SQL Server : remove duplicated text within a string - sql

I have a SQL Server 2008 table with a column containing lengthy HTML text. Near the top there is a link provided for an associated MP3 file which is unique to each record. The links are are all formatted as follows:
<div class="MediaSaveAs">Download Audio </div>
Unfortunately many records contain two or three sequential and identical instances of this link where there should be only one. Is there a relatively simple script I can run to find and eliminate the redundant links?

I'm not entirely sure - because your explanation wasn't very clear - but this appears to do what you want, although whether or not you consider this to be a "simple script", I don't know.
declare #Link nvarchar(200) = N'<div class="MediaSaveAs">Download Audio </div>'
declare #BadData nvarchar(max) = N'cbjahcgfhjasgfzhjaucv' + replicate(#Link, 3) + N'cabhjcsghagj',
#StartPattern nvarchar(34) = N'<div class="MediaSaveAs"><a href="',
#EndPattern nvarchar(27) = N'">Download Audio </a></div>'
select #BadData
select replace (
#BadData,
substring(#BadData, charindex(#StartPattern, #BadData), len(#BadData)-charindex(reverse(#EndPattern), reverse(#BadData))-charindex(#StartPattern, #BadData) + 2),
substring(#BadData, charindex(#StartPattern, #BadData), charindex(#EndPattern, #BadData) + len(#EndPattern) - charindex(#StartPattern, #BadData))
)
Personally I would not like to have to maintain this code; I would far rather use a script in another language that can actually parse HTML. You said this is "just a repeated text issue", but that doesn't mean it's an easy problem and especially not in a language like TSQL that has such limited support for string operations.
For future reference, please put all relevant information into the question - you can edit it if you need to - instead of leaving them in the comments where they are difficult to read and may be overlooked. And please post sample data and results instead of describing things in words.

First we need to identify the file names, which we can do with PATINDEX:
select
substring(html, PATINDEX('%filename%.mp3%', html), PATINDEX('%.mp3%', html)-PATINDEX('%filename%.mp3%', html)+4)
from files
And then secondly identify and the duplicates, check it out:
delete
from files
where id not in (
select max(id)
from files
group by substring(html, PATINDEX('%filename%.mp3%', html), PATINDEX('%.mp3%', html)-PATINDEX('%filename%.mp3%', html)+4)
)
http://www.sqlfiddle.com/#!3/887a3/5

Related

Is there a more efficient way to parse a fixed txt file in Access than using queries?

I have a few large fixed with text files that have multiple specification formats in them. I need to parse out the txt files based on a character with a set location in the file. That character can have a different position in the file.
I have written queries for each of the different specifications (95 of them) with the start position and length hard coded into the query using the mid() function with a WHERE() function to filter the [Record Identifier] from the specification. As you can see below the 2 specifications in the WHERE() function have different placements in the txt file.
\\\
SELECT Mid([AllData],1,5) AS PlanNumber, Mid([AllData],6,4) AS Spaces1, Mid([AllData],10,3) AS Filler1, Mid([AllData],13,11) AS SSN, Mid([AllData],24,1) AS AccountIdentifier, Mid([AllData],25,5) AS Filler2, Mid([AllData],30,2) AS RecordIdentifier, Mid([AllData],32,1) AS FieldType, Mid([AllData],33,4) AS Filler3, Mid([AllData],37,8) AS HireDate, Mid([AllData],45,8) AS ParticipationDate, Mid([AllData],53,8) AS VestinDate, Mid([AllData],61,8) AS DateOfBirth, Mid([AllData],77,1) AS Spaces2, Mid([AllData],78,1) AS Reserved1, Mid([AllData],79,1) AS Reserved2, Mid([AllData],80,1) AS Spaces3
FROM TBL_Company1
WHERE (((Mid([AllData],30,2))="02") AND ((Mid([AllData],32,1))="D"));
\\\
Or
\\\
SELECT Mid([AllData],1,5) AS PlanNumber, Mid([AllData],6,4) AS Spaces1, Mid([AllData],10,3) AS Filler1, Mid([AllData],13,11) AS SSN, Mid([AllData],24,1) AS AccountIdentifier, Mid([AllData],25,7) AS RecordIdentifier, Mid([AllData],32,22) AS StreetAddressForBank, Mid([AllData],54,20) AS CityForBank, Mid([AllData],74,2) AS StateForBank, Mid([AllData],76,5) AS ZipCodeForBank
FROM TBL_Company1
WHERE (((Mid([AllData],25,7))="49EFTAD"));
\\\
Is there a way to Parse out this without having to hard code every position and length into the code?
I was thinking of having a table with all of the specifications in it and have an import function look to the specification table and parse out the data accordingly to a new table or maybe something else.
What I have done is not very scalable and if the format changes a little I would have to go back to each query to change it.
Any Help is greatly appreciated
I think in your situation, I'd want to be able to generate the SQL statement dynamically, as you suggest.
I'd have a table something like:
Format#,Position,OutColName,FromPos,Length,WhereValue
1,1,"PlanNumber",1,5,
1,2,"Spaces1",6,4,
...
1,n,,30,2,"02"
1,n+1,,32,1"D"
and then some VBA to process it and build and execute the SQL string(s). The SELECT clause entries would be recognized by having a value in the OutColName field and WHERE clause entries by values in the the WhereValue column.
Of course this is only more "efficient" in the sense that it's a bit easier to code up new formats or fix/modify existing ones.

How to make LIKE in SQL look for specific string instead of just a wildcard

My SQL Query:
SELECT
[content_id] AS [LinkID]
, dbo.usp_ClearHTMLTags(CONVERT(nvarchar(600), CAST([content_html] AS XML).query('root/Physicians/name'))) AS [Physician Name]
FROM
[DB].[dbo].[table1]
WHERE
[id] = '188'
AND
(content LIKE '%Urology%')
AND
(contentS = 'A')
ORDER BY
--[content_title]
dbo.usp_ClearHTMLTags(CONVERT(nvarchar(600), CAST([content_html] AS XML).query('root/Physicians/name')))
The issue I am having is, if the content is Neurology or Urology it appears in the result.
Is there any way to make it so that if it's Urology, it will only give Urology result and if it's Neurology, it will only give Neurology result.
It can be Urology, Neurology, Internal Medicine, etc. etc... So the two above used are what is causing the issue.
The content is a ntext column with XML tag inside, for example:
<root><Location><location>Office</location>
<office>Office</office>
<Address><image><img src="Rd.jpg?n=7513" /></image>
<Address1>1 Road</Address1>
<Address2></Address2>
<City>Qns</City>
<State>NY</State>
<zip>14404</zip>
<phone>324-324-2342</phone>
<fax></fax>
<general></general>
<from_north></from_north>
<from_south></from_south>
<from_west></from_west>
<from_east></from_east>
<from_connecticut></from_connecticut>
<public_trans></public_trans>
</Address>
</Location>
</root>
With the update this content column has the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<Physicians>
<name>Doctor #1</name>
<picture>
<img src="phys_lab coat_gradation2.jpg?n=7529" />
</picture>
<gender>M</gender>
<langF1>
English
</langF1>
<specialty>
<a title="Neurology" href="neu.aspx">Neurology</a>
</specialty>
</Physicians>
</root>
If I search for Lab the result appears because there is the text lab in the column.
This is what I would do if you're not into making a CLR proc to use Regexes (SQL Server doesn't have regex capabilities natively)
SELECT
[...]
WHERE
(content LIKE #strService OR
content LIKE '%[^a-z]' + #strService + '[^a-z]%' OR
content LIKE #strService + '[^a-z]%' OR
content LIKE '%[^a-z]' + #strService)
This way you check to see if content is equal to #strService OR if the word exists somewhere within content with non-letters around it OR if it's at the very beginning or very end of content with a non-letter either following or preceding respectively.
[^...] means "a character that is none of these". If there are other characters you don't want to accept before or after the search query, put them in every 4 of the square brackets (after the ^!). For instance [^a-zA-Z_].
As I see it, your options are to either:
Create a function that processes a string and finds a whole match inside it
Create a CLR extension that allows you to call .NET code and leverage the REGEX capabilities of .NET
Aaron's suggestion is a good one IF you can know up front all the terms that could be used for searching. The problem I could see is if someone searches for a specific word combination.
Databases are notoriously bad at semantics (i.e. they don't understand the concept of neurology or urology - everything is just a string of characters).
The best solution would be to create a table which defines the terms (two columns, PK and the name of the term).
The query is then a join:
join table1.term_id = terms.term_id and terms.term = 'Urology'
That way, you can avoid the LIKE and search for specific results.
If you can't do this, then SQL is probably the wrong tool. Use LIKE to get a set of results which match and then, in an imperative programming language, clean those results from unwanted ones.
Judging from your content, can you not leverage the fact that there are quotes in the string you're searching for?
SELECT
[...]
WHERE
(content LIKE '%""Urology""%')

SQL Full Text Contains not returning expected rows

I have a Body column that is full text indexed and is nvarchar(max)
One row has this in the Body column
You want slighty mad this sat the 60th runing of the 3peaks race! Peny-ghent whernside and inglbauher! Only in yorkshire!
If I run: select body from messages where CONTAINS(Body,'you') it doesn't return any data.
If I run the below adding wildcards select messageid,body from messages where CONTAINS(body,'"*you*"') it still doesnt return the data.
Can you help me understand what's going on please?
Thanks
UPDATE : It makes no difference if its you or You, either way no results
It can be case sensitivity issue. Try with select messageid,body from messages where CONTAINS(body,'"*You*"') and see if you are getting the result or not
A full text catalog has a set of words in a “stoplist” that it won’t search on as SQL Server considers them “unimportant for search purposes”
To get this you can run
select ssw.*
from sys.fulltext_system_stopwords ssw
where ssw.language_id = 1033;
Below are the words it won’t search on and you’ll see it contains “you” hence why it didn’t find my data.

Modify a column, to get rid of html surrounding an ID

I have a table and one of the columns contains html for an iFrame & within it an external video, specifically it's like
<iframe src="http://host.com/videos/ID" otherattributes...></iframe>.
I need to update the current column or create a new one (doesn't matter) so what I have is just the ID of that video, I know I could use a regex for it but I'm really weak with it.
perhaps so it find the content that is within literal characters: [videos/] and the upcoming ["] which comes right after the ID but I'm unsure how.
You can use CHARINDEX() function:
update T SET
VideoID=SUBSTRING(descr,
charindex('/videos/',descr)+LEN('/videos/'),
charindex('"',descr,charindex('/videos/',descr)+LEN('/videos/'))
-(charindex('/videos/',descr)+LEN('/videos/')))
SQLFiddle demo
This should work, assuming the text videos/ doesn't appear anywhere else in the html.
update htmltable
set id = SUBSTRING(SUBSTRING(html,
CHARINDEX('videos/', html) + 7,
LEN(html)
),
0,
CHARINDEX('"', SUBSTRING(html,
CHARINDEX('videos/', html) + 7,
LEN(html)
)
)
)
This updates a field named otherfield in table htmltable where the id in the url is '123'. It's pretty ugly code, but SQL Server has limited string functions.
If you have any control over the table structure, I would suggest you make some changes. The video ID should be stored in its own column, separate from the rest of the url. Then when you need to retrieve the url, you would concatenate the two parts to get the whole url. That would be much more maintainable.

MOSS 2007: What is the source of "Directories"?

I'm trying to generate a new SharePoint list item directly using SQL server. What's stopping me is damn tp_DirName column. I have no ideas how to create this value.
Just for instance, I have selected all tasks from AllUserData, and there are possible values for the column: 'MySite/Lists/Task', 'Lists/Task' and even 'MySite/Lists/List2'.
MySite is the FullUrl value from Webs table. I can obtain it. But what about 'Lists/Task' and '/Lists/List2'? Where they are stored?
If try to avoid SQL context, I can formulate it the following way: what is the object, that has such attribute as '/Lists/List2'? Where can I set it up in GUI?
Just a FYI. It is VERY not supported to try and write directly to SharePoint's SQL Tables. You should really try and write something that utilizes the SharePoint Object Model. Writing to the SharePoint database directly mean Microsoft will not support the environment.
I've discovered, that [AllDocs] table, in contrast to its title, contains information about "directories", that can be used to generate tp_DirName. At least, I've found "List2" and "Task" entries in [AllDocs].[tp_Leaf] column.
So the solution looks like this -- concatenate the following 2 components to get tp_DirName:
[Webs].[FullUrl] for the web, containing list, containing item.
[AllDocs].[tp_Leaf] for the list, containing item.
Concatenate the following 2 components to get tp_Leaf for an item:
(Item count in the list) + 1
'_.000'
Regards,
Well, my previous answer was not very useful, though it had a key to the magic. Now I have a really useful one.
Whatever they said, M$ is very liberal to the MOSS DB hackers. At least they provide the following documents:
http://msdn.microsoft.com/en-us/library/dd304112(PROT.13).aspx
http://msdn.microsoft.com/en-us/library/dd358577(v=PROT.13).aspx
Read? Then, you know that all folders are listed in the [AllDocs] table with '1' in the 'Type' column.
Now, let's look at 'tp_RootFolder' column in AllLists. It looks like a folder id, doesn't it? So, just SELECT the single row from the [AllDocs], where Id = tp_RootFolder and Type = 1. Then, concatenate DirName + LeafName, and you will know, what the 'tp_DirName' value for a newly generated item in the list should be. That looks like a solid rock solution.
Now about tp_LeafName for the new items. Before, I wrote that the answer is (Item count in the list) + 1 + '_.000', that corresponds to the following query:
DECLARE #itemscount int;
SELECT #itemscount = COUNT(*) FROM [dbo].[AllUserData] WHERE [tp_ListId] = '...my list id...';
INSERT INTO [AllUserData] (tp_LeafName, ...) VALUES(CAST(#itemscount + 1 AS NVARCHAR(255)) + '_.000', ...)
Thus, I have to say I'm not sure that it works always. For items - yes, but for docs... I'll inquire into the question. Leave a comment if you want to read a report.
Hehe, there is a stored procedure named proc_AddListItem. I was almost right. MS people do the same, but instead of (count + 1) they use just... tp_ID :)
Anyway, now I know THE SINGLE RIGHT answer: I have to call proc_AddListItem.
UPDATE: Don't forget to present the data from the [AllUserData] table as a new item in [AllDocs] (just insert id and leafname, see how SP does it itself).