Remove base64 value from string column in Postgres - sql

I have a text column in a table that contains HTML data along with image represented in base64 encoding.
Here is an example:
</p><p><span lang="EN"> </span></p><p>
</p><p><img width="263" height="135" align="right" src="data:image/png;base64,/9j/4AAQSkZJRgABAQEAYABgAA...." alt=""></p>
The string after base64 is really long. I want to remove the long string representation and replace with the word "image".
I tried a pattern match on base64, and remove everything after that until " mark before the alt keyword. It worked on cases where there is only occurrence of a base64 value. When there are multiple occurrences, it fails.
Is there a better way to approach this problem in order to remove just the string representing image in base64 encoding?

To have the actual replacement work more than just once, you need to use the "global" flag for the regexp_replace, e.g.:
=# SELECT regexp_replace(E'\n\n...height="135" align="right" src="data:image/png;base64,/9j/4AAQSkZJRgABAQEAYABgAA...." alt="" ...\n<p></p>\n<p><img align="left" src="data:image/png;base64,/9j/4AAQSkZJRgABAQEAYABgAA...." class="test" alt=""/>\n', '(data:[^,]+,)[^"]+', '\1<data>', 'g');
regexp_replace
------------------------------------------------------------------------------
+
+
...height="135" align="right" src="data:image/png;base64,<data>" alt="" ... +
<p></p> +
<p><img align="left" src="data:image/png;base64,<data>" class="test" alt=""/>+
(1 row)
...so: regexp_replace(my_html_column, '(data:[^,]+,)[^"]+', '\1<data>', 'g')
That should match and replace all data URI's of the given text.

Maybe your problem id the greedy match, and the solution is to match anything but " characters:
regexp_replace(col, 'base64,[^"]*', 'image')

Related

How to tokenize html tags with spacy?

I need to tokenize html text with spacy. Or merge tags after tokenization. They can be any html tags, e.g.:
<br> <br/> <br > <n class="ggg">
There is an example of tag merging in documentation for tag, but it can't work with all types of tags. If I write rule like:
[{'ORTH': '<'}, {}, {'ORTH': '>'}]
It will join some tags:
<br><p>
Or separate like:
<
n
class="ggg
"
>
I have tried to write custom tokenizer also, but I had problem with spaces.
I want every html tag to be a separate token, e.g.:
<br>
<br >
<n class="ggg">
IMHO, removing the HTML tags and converting to plain text is the correct way to go, rather than making html tags 'stop words', because some of those tags are actually valid words that can appear in text and should NOT be ignored (e.g., <body> vs body).
If you have a construct like
<span>word</span><span>word</span>
It renders as wordword in a user agent and should in fact be interpreted as a single word. For example, one might give you an HTML page containing something like:
<p><strong>S</strong>oup .... </p>
This obviously renders as 'Soup' and should be taken as the word soup and not as the words s and oup.
Now, if for whatever reason you must assume that any HTML tag boundary is a word separator (wrong, in most cases), you should do the following: use an HTML stream tokenizer, e.g., libxml2 and write handlers for startElement and characters only. The former should output a single space and the latter should output the characters as it gets them. This will convert your HTML input to plain text (just like an HTML tag remover would do), but also add a space after each element tag, so <span>word</span><span>word</span> would get converted to: "(space)word(space)word". This might add multiple spaces when nested tags are present, but you can easily deal with this when you split the cleaned-up text into words for further processing.

How do I eliminate the spaces?

I would like to collect the Japanese articles searched by google. I try to extract Japanese sentences, then I run the following code in order to get the tag including the most Japanese words.
texts = mostTag.xpath('<<path>>/text()').extract()
text = ''
for s in texts:
text += s
but when I run this code, extracted sentence has spaces in their head.
For example, If html is as below and path is '//p',
<p class dir='sample'>
<span>
<a role='button' tabindex='0' style='white-space: normal;'>A
B</a>
<span> </span>
</span>
</p>
I got the sentences as below.
A
B
I tried to eliminate this spaces by method 'text.strip()', but the spaces remained.
How do I get the 'AB' from this html? Or how do I eliminate the spaces? I appreciate it if anyone tell me how to get 'AB'.
This can be done with a regular expression:
>>> import re
>>> re.sub(r'\n\s+', '', s)
'AB'

How to not escape special chars when updating XML in oracle SQL

I have a problem trying to update xmlType values in oracle.
I need to modify the xml looking similar to the following:
<a>
<b>Something to change here</b>
<c>Here is some narrative containing weirdly escaped <tags>\</tags> </c>
</a>
What I want to achieve is to modify <b/> without modifying <c/>
Unfortunately following modifyXml:
select
updatexml(XML_TO_MODIFY, '/a/b/text()', 'NewValue')
from dual;
returns this:
<a>
<b>NewValue</b>
<c>Here is some narrative containing weirdly escaped <tags></tags> </c>
</a>
as you can see, the '>' had been escaped.
Same happens for xmlQuery (the new non-deprecated version of updateXml):
select /*+ no_xml_query_rewrite */
xmlquery(
'copy $d := .
modify (
for $i in $d/a
return replace value of node $i/b with ''nana''
)
return $d'
passing t.xml_data
returning content
) as updated_doc
from (select xmlType('<a>
<b>Something to change here</b>
<c>Here is some narrative containing weirdly escaped \<tags>\</tags> </c>
</a>') as xml_data from dual) t
;
Also when using xmlTransform I will get the same result.
I tried to use the
disable-output-escaping="yes"
But it did the opposite - it unescaped the < :
select XMLTransform(
xmlType('<a>
<b>Something to change here</b>
<c>Here is some narrative containing weirdly escaped \<tags>\</tags> </c>
</a>'),
XMLType(
'<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/a/b">
<b>
<xsl:value-of select="text()"/>
</b>
</xsl:template>
<xsl:template match="/a/c">
<c>
<xsl:value-of select="text()" disable-output-escaping="yes"/>
</c>
</xsl:template>
</xsl:stylesheet>'))
from dual;
returned:
<a>
<b>NewValue</b>
<c>Here is some narrative containing weirdly escaped <tags></tags> </c>
</a>
Any suggestions?
Two things you need to know:
I cannot modify the initial format - it comes to me in this way and
I need to preserve it.
The original message is so big, that changing
the message to string and back (to use regexps as workaround) will
not do the trick.
The root of your issue seems to be that your original XML value for node C is not valid XML if it contains the > within the value instead of >, and not inside a CDATA section (also What does <![CDATA[]]> in XML mean?).
The string value of:
Here is some narrative containing weirdly escaped <tags>\</tags>
in XML format should really be
<c>Here is some narrative containing weirdly escaped &lt;tags>\&lt;/tags></c>
OR
<c><![CDATA[Here is some narrative containing weirdly escaped <tags>\</tags>]]></c>
I would either request that the XML be corrected at the source, or implement some method to sanitize the inputs yourself, such as wrapping the <c> node values in <![CDATA[]]>. If you need to save the exact original value, and the messages are large, then the best I can think of is the store duplicate copies, with the original value as string, and store the "sanitized" value as XML data type.
In the end we managed to do this with the help of java.
By:
reading the xml as a clob
modifying it in java
storing it back in the database using java.sql.Connection (for some reason, if we used
JdbcTemplate, it complained about casting to Long, which was
indication that string was over 4000 bytes (talking about clean
errors, all hail Oracle) and using CLOB Type didn't really
help. I guess it's a different story though)
When storing the data, oracle does not perform any magic, only updates tend to modify escape characters.
Possibly not an answer for everyone, but a nice workaround if you stumble upon same problem as we did.

How to handle new line in handlebar.js

I am using HandleBar.js in my rails jquery mobile application.
I have a json returned value data= "hi\n\n\n\n\nb\n\n\n\nhow r u"
which when used in .hbs file as {{data}} showing me as hi how r u and not as with the actual new line inserted
Please suggest me.
Pre tag helps me
Handlebars doesn't mess with newlines in your data unless you have registered a helper which is doing something with them. A good way of dealing with newlines in HTML without converting them to br tags would be to use the CSS property white-space while rendering the handlebars template in HTML. You can set its value to pre-line.
Read the related documentation on MDN
Look at the source of the generated file - your newline characters are probably there, HTML simply does not render newline characters as new lines.
You can insert a linebreak with <br />
However, it looks like you're trying to format the position of your lines using newline characters, which technically should be done by wrapping your lines in <p> or <div> tags and styling with CSS.
Simply use the CSS property white-space and set the value as pre-line
For a example:
<p style="white-space: pre-line">
{{text}}
</p>

Keep text formatting in SQL

I have a text area that inserts its content into a SQL table. Is there a way to keep the formatting of the text and then use it in HTML?
I'll assume you're talking about preserving line breaks.
Either:
Output the text inside a <pre> tag
or
Convert newlines to <br /> tags before insertion to the DB. (E.g. nl2br in PHP).
If you mean keep the Enters then replace the char 10 and char 13 with <br/>
When using SQL (note the enters)
select replace('
test
test','
','<br/>')
This results in <br/>test<br/>test
Text is text is text. Insert the text into the table including its markup and it will come out that way as well.
...or am I misunderstanding your question?