Split table data in SQL and replace with results - sql

I need to remove a bunch of unneeded data from each table based on split parameters.
My SQL table is storing a bunch of HTML for caching, The data is already in SQL and it's growing to be quite large so now I want to split some of the data I don't use from each table based on a string and update the table with the new results.
cacheHTML table is holding data like this
<html>
... (a bunch of data I don't need)
<first div>
... (the data I do want to save)
</div>
... (data I don't care about also)
</html>
I only want whats inside the first div and to remove all the html up to that point.
Is there any easy method for this? I need to do this to 5k rows of cached data...
I need a function or method to say give me everything between string1 till string2 then replace the table with the results. Any help would be appreciated Thanks!

You could do something like this. Will only work if you always need the text inside the first div in the html string. Im assuming SQL Server as database system but it could probably be translated to others pretty easily.
Sample html string:
<html>
<head>
<title>Stuff i dont need</title>
</head>
<body>
<h1>Stuff i dont need</title>
<p>I dont need any of this data</title>
<div>This is the data i need to save!</div>
<h3>Dont need this</h3>
<div>Wont need this either!<div>
<h3>Bye</h3>
</body>
SQL to do the update:
UPDATE cacheHTML
SET htmlText = REPLACE(SUBSTRING(htmlText, CHARINDEX('<div>', htmlText, 0), CHARINDEX('</div>', htmlText, 0) - CHARINDEX('<div>', htmlText, 0)), '<div>', '')

Related

How to do a full text search, when content has HTML tags?

I want to make a full-text search using text content in HTML format. E.g.:
... f<em>oo</em> ...
If the search term will be 'foo' the document containing "foo" will not be found
How to make this work?
I'm using PostgreSQL
I'm afraid you won't find anything out of the box, but you could extract text from your html column, e.g. using regular expressions or xpath (designed for xml)
CREATE TABLE t (html text);
INSERT INTO t VALUES ('<html>
<h1>
<foo>test f<em>oo</em> bar</foo>
</h1>
<h1>
<foo>test bar</foo>
</h1>
</html>');
SELECT * FROM t
WHERE to_tsvector(regexp_replace(html,'<.+?>','','g')) ## plainto_tsquery('test foo');
html
--------------------------------------
<html> +
<h1> +
<foo>test f<em>oo</em> bar</foo>+
</h1> +
<h1> +
<foo>test bar</foo> +
</h1> +
</html>
Keep in mind that creating the ts_vector in query time will make things quite slow. So, if you decided to go this way consider creating a new column for it and then create a gin index, e.g.
CREATE INDEX idx_html_tsvector ON t USING gin(the_new_column);
Demo: db<>fiddle
The PostgreSQL text parser will recognize tags but considers them as ending words, so it will parse that as 'f' and 'oo'. This cannot be changed without hacking the C code. You could implement your own parser, but again in C, and doing so is not easy. Your best bet is probably to pre-process your text with something else to remove the tag, making sure that closes up the gap to give 'foo', not 'f oo'.

Extract data from XML string in Hive Table without using XPath

I am trying to use a view to extract a string(value) from a large XML string that sits in a single column in a hive table. I need to get the associated FOO_STRING_VALUE for COMPANY_ID, SALE_IND, and CLOSING_IND.
<Message>
<Header>
<FOO_STRING>
<FOO_STRING_NAME>COMPANY_ID</FOO_STRING_NAME>
<FOO_STRING_VALUE>44-1235</FOO_STRING_VALUE>
</FOO_STRING>
<FOO_STRING>
<FOO_STRING_NAME>SALE_IND</FOO_STRING_NAME>
<FOO_STRING_VALUE>Y</FOO_STRING_VALUE>
</FOO_STRING>
<FOO_STRING>
<FOO_STRING_NAME>CLOSING_IND</FOO_STRING_NAME>
<FOO_STRING_VALUE>Y</FOO_STRING_VALUE>
</FOO_STRING>
</Header>
</Message>
The XML file can have up to 50 "FOO_STRINGS" and there is no guarantee in what order they will be in so I can not use XPATH unless I have 50 xpath_string calls for each Name/Value pair and matched them up later. I am using xpath like this .....
xpath_string(xml_txt, '/Message/Header/FOO_STRING[1]/FOO_STRING_VALUE') AS String_Val_1
xpath_string(xml_txt, '/Message/Header/FOO_STRING[2]/FOO_STRING_VALUE') AS String_Val_2
xpath_string(xml_txt, '/Message/Header/FOO_STRING[3]/FOO_STRING_VALUE') AS String_Val_3
However, if the order changes than it doesn't work. I'm wondering if there is a quick way to get to find the FOO_STRING_NAME needed the and get the corresponding Value using regexp_extract() or some other way? I am not familiar with Regex so any help or suggestions would be helpful, Thank you a ton
" if the order changes than it doesn't work "
Don't use position, then.
xpath_string(xml_txt, '/Message/Header/FOO_STRING[FOO_STRING_NAME="COMPANY_ID"]/FOO_STRING_VALUE') AS String_Val_1
xpath_string(xml_txt, '/Message/Header/FOO_STRING[FOO_STRING_NAME="SALE_IND"]/FOO_STRING_VALUE') AS String_Val_2
xpath_string(xml_txt, '/Message/Header/FOO_STRING[FOO_STRING_NAME="CLOSING_IND"]/FOO_STRING_VALUE') AS String_Val_3

Remove HTML Tags from a text using SQL

I have column in a table which contains HTML text(data contains HTML tags) and also normal text.
I need to remove the HTML tags in the data wherever it exists.
Steps I planned:
Filter only the records which contains HTML tags. --> I am able to complete this step. My Logic: where HTMLStirng like('<%>%')
Replace HTML tags with a blank space. --> I am trying to apply replace function. But I am not able to.
For Example:
<p>Paragraph</p>
<b>bold</b><I>Italic</I>
Normal Text
My Output shold be:
Paragraph
BoldItalic
Normal Text
Can someone help me in the step 2 ?
If you are using Oracle, try the following
SELECT Regexp_replace(your_column_name, '<.+?>')
FROM dual;
Example
SELECT Regexp_replace('<b>bold</b><I>Italic</I> Testing', '<.+?>')
FROM dual;

WebMatrix Nested WebGrid VB code

I have looked at the code in both the following posts:
formatting in razor nested webgrid, replied to by nemesv in October 2011 and
Razor Nested WebGrid, replied to by Chad Moran in April 2011.
They both seem to be close to my problem but the code is C# based, I believe, and I am having difficulty converting it to VB. I am also not sure they are exactly where I am at. I am particularly bemused by the following line, because of the two equals signs and double reference to subGrid.
WebGrid subGrid = subGrid = new WebGrid(item.SubItems)
I am also not sure whether topGrid and subGrid are just generic names, used for the purpose of illustration, or whether they are key words.
As a very relevant aside I will mention that this point in my web page project has held me up for five years now (I am not exaggerating - I just stopped working on the project for two years because of it). I have tried using ASP in VWD and now Grid View in WebMatrix and I hope I will not fail again.
Database record
Fields: Publisher_Name, Publisher_City, Series_Published, No_of_Series
Record Example: Price Stern Sloan, Baltimore, JKLMNO, 6
My two planned grid names
Publishers_Grid (top)
Series_Grid (sub)
What I am trying to do
For each of the characters in the string JKLMNO, access a second table, where each letter is the primary key for a record in that table.
Retrieve the value of the field, Back_Cover_Image, in that second table, which will be the file name or, at least, the unique part of the file name, for the image to be displayed.
If I go with the partial unique bit of file name approach, build the full file names for the images. And then -
Display as a second web grid row, the images thus pointed at, in the record example, that would be 6 images.
Thus I would end up, for the example record, something like the following (I have used XX to stand for an image): -
Price Stern Sloan Baltimore XX XX XX XX XX XX
I certainly I hope I am not wasting the valuable time of experts who I greatly admire. I'm just trying to achieve something that seems quite simple to me, having originally been a PL/1 programmer, 30 years ago, and a great user of nested arrays within that language, but I just can't work out the syntax in VB, Razor and WebMatrix.
I look forward to some constructive answers, and please do use VB.
My WebMatrix page so far
#Code
Layout = "~/Shared/Layouts/_Layout.vbhtml"
Dim HWB_Database As Database = Database.Open("How_and_Why_Wonder_Books")
Dim HWB_Publishers_All_sqlCommand = "SELECT * FROM Publishers ORDER BY Publisher_Code"
Dim Publishers_Data = HWB_Database.Query(HWB_Publishers_All_sqlCommand)
Dim Publishers_Grid = New WebGrid(Publishers_Data)
End Code
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>How and Why Wonder Books - Publishers</title>
</head>
<body>
<div style="margin-left: 100px">
<p style="Width: 1020px; border-width: 1px" class="InstructionsHeader">
Click on a publisher to see the list of titles produced under that imprint, click on a thumbnail to see details of that series type.
</p>
</div>
<br>
<br>
<div id="Publishers_Grid_Display">
#Publishers_Grid.GetHtml(columns:= Publishers_Grid.Columns(
Publishers_Grid.Column("Publisher_Name"),
Publishers_Grid.Column("Place_of_Publication"),
Publishers_Grid.Column("Series_Published")
)
)
</div>
Thank you for your promptings and encouragement.

Need help for complicated sql update

i have a table which has many records. i am storing html data in a particular fields called Data of that table. html data in each records have many IMG tag like <img src='test.gif' />. as a sample page url here http://www.bba-reman.com/content.aspx?content=bba_reman_diagnostics_tools
go there and see that a page is showing many product images and all data comes from table. i want to use lazyload jquery plugin and for that IMG tag should look like <img src="img/grey.gif" data-original="img/example.jpg" >. so i need to update my table html data.
so i need to write a sql update statement which would iterate in all html data in all rows and find img tag inside the particular div find by ID and change src url of IMG tag like src will be fixed like src="img/grey.gif" for all images and add one attribute to all img tag like data-original="img/example.jpg"
i know my situation is bit horrible for update statement. please suggest a good way to update all IMG tag writing sql. thanks
Assuming that all your tags do actually end in />, then this would work
UPDATE myTable
SET tag = LEFT(tag, CHARINDEX('/', tag) - 1) + 'data-original=''example.gif'' />'
However, that wouldn't change the quotes, as you have done in your question, and it wouldn't remove the closing slash before the tag end, as you have done in your question.