Removing HTML Tags from Big Query - sql

I have the a column in my table which stores a paragraph like below :
<p><img src="https://mywebsite.com/medias/NH2xcoUOfANfFb6l4xNgOFch3dc4TvoX2XBnI6to.jpg" alt="" width="250" height="33"></p><p><span style="font-size: 16pt; font-family: Mali, cursive; font-weight: 500;">My beautiful text is here. Show me without tags, please.</span> </p>
I want to remove all the html tags and, if possible, replace an HTML image to (Image)text.
So my expected output will be like below :
(Image) My beautiful text is here. Show me without tags, please.
OR just
My beautiful text is here. Show me without tags, please.
Thank you so much.

Try below naive approach
select html,
regexp_replace(
regexp_replace(
regexp_replace(html,
r'<img [^<>]*>', r'(Image) '),
r'(&)([^&;]*)(;)', r'<\2>'
),r'\<[^<>]*\>', ''
) as text
from your_table
if applied to sample data in your question - output is
As you can see first step is to replace Image Tag with (Image) text, second step is to address HTML encoding by enclosing them into <...> - for example becomes < > and finally remove everything between and including < and >
Note: above is simplistic approach - might not work for more complex htmls

Related

How to do a full text search, when content has HTML tags?

I want to make a full-text search using text content in HTML format. E.g.:
... f<em>oo</em> ...
If the search term will be 'foo' the document containing "foo" will not be found
How to make this work?
I'm using PostgreSQL
I'm afraid you won't find anything out of the box, but you could extract text from your html column, e.g. using regular expressions or xpath (designed for xml)
CREATE TABLE t (html text);
INSERT INTO t VALUES ('<html>
<h1>
<foo>test f<em>oo</em> bar</foo>
</h1>
<h1>
<foo>test bar</foo>
</h1>
</html>');
SELECT * FROM t
WHERE to_tsvector(regexp_replace(html,'<.+?>','','g')) ## plainto_tsquery('test foo');
html
--------------------------------------
<html> +
<h1> +
<foo>test f<em>oo</em> bar</foo>+
</h1> +
<h1> +
<foo>test bar</foo> +
</h1> +
</html>
Keep in mind that creating the ts_vector in query time will make things quite slow. So, if you decided to go this way consider creating a new column for it and then create a gin index, e.g.
CREATE INDEX idx_html_tsvector ON t USING gin(the_new_column);
Demo: db<>fiddle
The PostgreSQL text parser will recognize tags but considers them as ending words, so it will parse that as 'f' and 'oo'. This cannot be changed without hacking the C code. You could implement your own parser, but again in C, and doing so is not easy. Your best bet is probably to pre-process your text with something else to remove the tag, making sure that closes up the gap to give 'foo', not 'f oo'.

How to read a text when it is not in any HTML tag

How can I find text in following HTML:
style="background-color: transparent;">
<-a hre f="/">Home<-/ a>
<-a id="brea dcrumbs-790" class=" main active mainactive" href="/products">Products<-/a>
<-a href="/products/fruit-and-creme-curds">Fruit & Crème Curds<-/a>
Crème Banana Curd
<-/li>
<-/ul>"
</div>
This is HTML for Bread Crumb, first three are link and fourth is page name. I want to read page name (Crème Banana Curd) from Bread crumb. But since this is not in any node so how to catch it
If the text isn't present inside any tag, then it is present in body tag:-
So you can use something like below to identify it:-
html/body/text()
Though the question seems to be vague without a proper HTML source but still you may try the solution below by storing the Text in a Variable-
var breadcrumb = FindElement(By.XPath(".//*[#id='brea dcrumbs-790']/following-sibling::a")).Text;
use the below code:
WebElement elem = driver.findElement(By.xpath("//*[contains(text(),'Crème Banana Curd')]"));
elem.getText();
hope this will help you.

Need to keep <br> in text block tags while using import.io

Looking to do something relatively straightforward, I'm scraping text which so far I have had no problem grabbing, but I need to keep the <br> tags because white space analysis is an important part of the dataset.
Is there a way to keep the <br> tags so I can turn them into \n\rlater on.
Example:
<p>
<span>Some text.</br></span>
<a>Some more text.<br></a>
<span>Some more more text.<br></span>
</p>
I need : Some text.<br>Some more text.<br>Some more more text.<br>
Right now I get: Some text. Some more text. Some more more text.
Advice?
The only way is to get the html format of your selection , all you have to do is change the column type from Text to HTML , also there is no way to get only the text + the <br>.

dijit.InlineEditBox with highlighted html

I have some dijit.InlineEditBox widgets and now I need to add some search highlighting over them, so I return the results with a span with class="highlight" over the matched words. The resulting code looks like this :
<div id="title_514141" data-dojo-type="dijit.InlineEditBox"
data-dojo-props="editor:\'dijit.form.TextBox\', onFocus:titles.save_old_value,
onChange:titles.save_inline, renderAsHtml:true">Twenty Thousand Leagues <span
class="highlight">Under</span> the Sea</div>
This looks as expected, however, when I start editing the title the added span shows up. How can I make the editor remove the span added so only the text remains ?
In this particular case the titles of the books have no html in them, so some kind of full tag stripping should work, but it would be nice to find a solution (in case of short description field with a dijit.Editor widget perhaps) where the existing html is left in place and only the highlighting span is removed.
Also, if you can suggest a better way to do this (inline editing and word highlighting) please let me know.
Thank you !
How will this affect your displayed content in the editor? It rather depends on the contents you allow into the field - you will need a rich-text editor (huge footprint) to handle html correctly.
These RegExp's will trim away XML tags
this.value = this.displayNode.innerHTML.replace(/<[^>]*>/, " ").replace(/<\/[^>]*>/, '');
Here's a running example of the below code: fiddle
<div id="title_514141" data-dojo-type="dijit.InlineEditBox"
data-dojo-props="editor:\'dijit.form.TextBox\', onFocus:titles.save_old_value,
onChange:titles.save_inline, renderAsHtml:true">Twenty Thousand Leagues <span
class="highlight">Under</span> the Sea
<script type="dojo/method" event="onFocus">
this.value = this.displayNode.innerHTML.
replace(/<[^>]*>/, " ").
replace(/<\/[^>]*>/, '');
this.inherited(arguments);
</script>
</div>
The renderAsHtml attribute only trims 'off one layer', so embedded HTML will still be html afaik. With the above you should be able to 1) override the onFocus handling, 2) set the editable value yourself and 3) call 'old' onFocus method.
Alternatively (as seeing you have allready set 'titles.save_*' in props, use dojo/connect instead of dojo/method - but you need to get there first, sort of say.

Need help for complicated sql update

i have a table which has many records. i am storing html data in a particular fields called Data of that table. html data in each records have many IMG tag like <img src='test.gif' />. as a sample page url here http://www.bba-reman.com/content.aspx?content=bba_reman_diagnostics_tools
go there and see that a page is showing many product images and all data comes from table. i want to use lazyload jquery plugin and for that IMG tag should look like <img src="img/grey.gif" data-original="img/example.jpg" >. so i need to update my table html data.
so i need to write a sql update statement which would iterate in all html data in all rows and find img tag inside the particular div find by ID and change src url of IMG tag like src will be fixed like src="img/grey.gif" for all images and add one attribute to all img tag like data-original="img/example.jpg"
i know my situation is bit horrible for update statement. please suggest a good way to update all IMG tag writing sql. thanks
Assuming that all your tags do actually end in />, then this would work
UPDATE myTable
SET tag = LEFT(tag, CHARINDEX('/', tag) - 1) + 'data-original=''example.gif'' />'
However, that wouldn't change the quotes, as you have done in your question, and it wouldn't remove the closing slash before the tag end, as you have done in your question.