Remove HTML Tags from a text using SQL

Remove HTML Tags from a text using SQL - sql

I have column in a table which contains HTML text(data contains HTML tags) and also normal text.
I need to remove the HTML tags in the data wherever it exists.
Steps I planned:
Filter only the records which contains HTML tags. --> I am able to complete this step. My Logic: where HTMLStirng like('<%>%')
Replace HTML tags with a blank space. --> I am trying to apply replace function. But I am not able to.
For Example:
<p>Paragraph</p>
<b>bold</b><I>Italic</I>
Normal Text
My Output shold be:
Paragraph
BoldItalic
Normal Text
Can someone help me in the step 2 ?

If you are using Oracle, try the following
SELECT Regexp_replace(your_column_name, '<.+?>')
FROM dual;
Example
SELECT Regexp_replace('<b>bold</b><I>Italic</I> Testing', '<.+?>')
FROM dual;

Related

How to do a full text search, when content has HTML tags?

I want to make a full-text search using text content in HTML format. E.g.:
... f<em>oo</em> ...
If the search term will be 'foo' the document containing "foo" will not be found
How to make this work?
I'm using PostgreSQL

I'm afraid you won't find anything out of the box, but you could extract text from your html column, e.g. using regular expressions or xpath (designed for xml)
CREATE TABLE t (html text);
INSERT INTO t VALUES ('<html>
<h1>
<foo>test f<em>oo</em> bar</foo>
</h1>
<h1>
<foo>test bar</foo>
</h1>
</html>');
SELECT * FROM t
WHERE to_tsvector(regexp_replace(html,'<.+?>','','g')) ## plainto_tsquery('test foo');
html
--------------------------------------
<html> +
<h1> +
<foo>test f<em>oo</em> bar</foo>+
</h1> +
<h1> +
<foo>test bar</foo> +
</h1> +
</html>
Keep in mind that creating the ts_vector in query time will make things quite slow. So, if you decided to go this way consider creating a new column for it and then create a gin index, e.g.
CREATE INDEX idx_html_tsvector ON t USING gin(the_new_column);
Demo: db<>fiddle

The PostgreSQL text parser will recognize tags but considers them as ending words, so it will parse that as 'f' and 'oo'. This cannot be changed without hacking the C code. You could implement your own parser, but again in C, and doing so is not easy. Your best bet is probably to pre-process your text with something else to remove the tag, making sure that closes up the gap to give 'foo', not 'f oo'.

How to add xml data(containing tags) in extent-report.html

I'm trying to add xml string into my extent-report. It is instead treating it as an html tag and displaying in the dom instead of UI.
I tried adding chars before and after xml string and then it is printing the chars but not the xml data
Reporter.addStepLog("text is <"+"<response><abc value=10></abc></response>"+">"
I want the report to show o/p as-> text is <<response><abc value=10></abc></response>>
but i am getting-> text is <>
P.S.:If you see the console you will get what i am trying to explain !

you may try to use following html element :
Reporter.addStepLog("text is <textarea rows='20' cols='40' style='border:none;'>"+"<response><abc value=10></abc></response>"+"</textarea>");
For information, I found this trick on this site :
Display XML content in HTML page

regexp_replace limit the number of characters displayed before / after each occurrence

I created a simple word search in a web application that goes through all documents stored in our Oracle 12c database and displays links to those documents that contain the specific word. In addition it orders the list of documents based on the number of occurrences in each document.
The documents are simple html formatted texts stored in NCLOB datatype column, first few lines:
<div class="content3">
<div class="content_bg">
<div class="mainbar3">
<div class="article">
<h2>Welcome to Web<br>
</h2>
<p>This web is categorized into several sub-webs which are accessible from the main page:</p>
<p><strong>Teams</strong></p>
<p>This sub web is accessible for everyone in o
The "web document" is displayed in the browser upon clicking the particular link in the search output.
In the search output (list of links to the documents) I would like to include excerpt of each document with the search word highlighted. Note that each document can contain any number greater than 1 of occurrences of the word and the search is supposed to be case insensitive. The search output contains only links to documents where the word occurs at least once.
Here is where I have got so far:
select 'Documents',
'open_document.aspx?id=' || doc_id,
regexp_replace(regexp_replace(web_code, '<(.|\n)*?>', ''), '(word)', '<b><span style="background-color: #ffff00;">\1</span></b>\2', 1, 0, 'i') web_code,
doc_name,
regexp_count(regexp_replace(web_code, '<(.|\n)*?>', ''), 'word', 1, 'i') occurences
from web_knowledge_base
where lower(web_code) like '%word%';
This displays the links to documents which contain the search word, number of occurrences that is later used to order the list of links, and it also displays the html documents with the search word highlighted (that is the regexp_replace with style part).
Any way to limit what is displayed in the html document (the regexp_replace part)
a) To display only the sentences that contain the search word for each occurrence
b) To display 10 characters before and after each occurrence of the search word
while still having the search word highlighted and search case insensitive?
I'd like to do this as part of the select statement if possible.
Thanks a lot!

Need to replace selected html tags from string in oracle sql

I need to replace few selected html tags from string .
Like for example let the string :
<B><SMALL>DSD-DNPH Color Cap Insert</SMALL></B>
and I did in following way :
REGEXP_REPLACE(name_text, '<[SMALL>]+>|<[/SMALL>]+>', '')
But I want to remove all selected html tags like: <B>, <SMALL> and <FONT> .
Can you please suggest me to do this in single line for multiple selector.

You can use the following construct to get rid of tags like constructs from string:
regexp_replace(name_text, '<.*?>')

Need help for complicated sql update

i have a table which has many records. i am storing html data in a particular fields called Data of that table. html data in each records have many IMG tag like <img src='test.gif' />. as a sample page url here http://www.bba-reman.com/content.aspx?content=bba_reman_diagnostics_tools
go there and see that a page is showing many product images and all data comes from table. i want to use lazyload jquery plugin and for that IMG tag should look like <img src="img/grey.gif" data-original="img/example.jpg" >. so i need to update my table html data.
so i need to write a sql update statement which would iterate in all html data in all rows and find img tag inside the particular div find by ID and change src url of IMG tag like src will be fixed like src="img/grey.gif" for all images and add one attribute to all img tag like data-original="img/example.jpg"
i know my situation is bit horrible for update statement. please suggest a good way to update all IMG tag writing sql. thanks

Assuming that all your tags do actually end in />, then this would work
UPDATE myTable
SET tag = LEFT(tag, CHARINDEX('/', tag) - 1) + 'data-original=''example.gif'' />'
However, that wouldn't change the quotes, as you have done in your question, and it wouldn't remove the closing slash before the tag end, as you have done in your question.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove HTML Tags from a text using SQL - sql

If you are using Oracle, try the following SELECT Regexp_replace(your_column_name, '<.+?>') FROM dual; Example SELECT Regexp_replace('<b>bold</b><I>Italic</I> Testing', '<.+?>') FROM dual;

Related

How to do a full text search, when content has HTML tags?

How to add xml data(containing tags) in extent-report.html

regexp_replace limit the number of characters displayed before / after each occurrence

Need to replace selected html tags from string in oracle sql

Need help for complicated sql update

Categories

Resources