PostgreSQL - Replace HTML Entities - sql

I have just set about the task of stripping out HTML entities from our database, as we do a lot of crawling and some of the crawlers didn't do this at input time :(
So I started writing a bunch of queries that look like;
UPDATE nodes SET name=regexp_replace(name, 'à', 'à', 'g') WHERE name LIKE '%#xe0%';
UPDATE nodes SET name=regexp_replace(name, 'á', 'á', 'g') WHERE name LIKE '%#xe1%';
UPDATE nodes SET name=regexp_replace(name, 'â', 'â', 'g') WHERE name LIKE '%#xe2%';
Which is clearly a pretty naive approach. I've been trying to figure out if there is something clever I can do with the decode function; maybe grabbing the html entity by regex like /&#x(..);/, then passing just the %1 part to the ascii decoder, and reconstructing the string...or something...
Shall I just press on with the queries? There will probably only be 40 or so of them.

Write a function using pl/perlu and use this module https://metacpan.org/pod/HTML::Entities
Of course you need to have perl installed and pl/perl available.
1)
First of all create the procedural language pl/perlu:
CREATE EXTENSION plperlu;
2) Then create a function like this:
CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
use HTML::Entities;
return decode_entities($_[0]);
$$ LANGUAGE plperlu;
3) Then you can use it like this:
select decode_html_entities('aaabbb&.... asasdasdasd …');
decode_html_entities
---------------------------
aaabbb&.... asasdasdasd …
(1 row)

You could use xpath (HTML-encoded content is the same as XML encoded content):
select
'AT&T' as input ,
(xpath('/z/text()', ('<z>' || 'AT&T' || '</z>')::xml))[1] as output

This is what it took for me to get working on Ubuntu 18.04 with PG10, and Perl didn't decode some entities like &comma; for some reason. So I used Python3.
From the command line
sudo apt install postgresql-plpython3-10
From your SQL interface:
CREATE LANGUAGE plpython3u;
CREATE OR REPLACE FUNCTION htmlchars(str TEXT) RETURNS TEXT AS $$
from html.parser import HTMLParser
h = HTMLParser()
if str is None:
return str
return h.unescape(str);
$$ LANGUAGE plpython3u;

Related

DB2 zOS: XMLQUERY with long namespace and narrow editor

This sounds like a stuipd questions nowadays. Unfortunately some of use still have to cope with technology from last millennial.
How can I use XMLQUERY with declare namespace and a namespace like urn:un:unece:uncefact:data:standard:ReusableAggregateBusinessInformationEntity:100 with an editor that is only 70 characters wide?
Basically I would like to run:
SELECT
xmlcast(
XMLQUERY('declare namespace ram="urn:un:unece:uncefact:data:standard:ReusableAggregateBusinessInformationEntity:100";
$e//ram:GrandTotalAmount'
PASSING XMLPARSE(DOCUMENT xmlcol) AS "e"
) AS integer)
FROM
mytable
But the namespace declaration is too long for the editor which only is 70 characters wide.
So far I found no way to break the declaration into multiple lines using ' || <newline> ' but any concat results in SQL Error [42601]: ILLEGAL USE OF KEYWORD PASSING
This depends on the program you use to execute these statements. With the standard DSNTEP2/SPUFI you just code up to column 72 and continue in column 1 like this (note that the column-numbering line is not part of the file, it's just the one displayed by using COLS):
//SYSTSIN DD *
DSN SYSTEM(DB2T)
RUN PROGRAM(DSNTEP2) -
PLAN (DSNTEP2)
END
//*
//SYSIN DD *
----+----1----+----2----+----3----+----4----+----5----+----6----+----7--
SELECT * FROM SOMEDATA.PLAN_TABLE WHERE EXPLAIN_TIME BETWEEN '2019-12
-13-00.00.00.000000' AND '2019-12-15-00.00.00.000000'
FETCH FIRST 500 ROWS ONLY;
/*
I thought that concatenating the query-expression should have worked, but it seems like IBM doesn't allow expressions in this place.
I managed to break up my query-expression in certain places by changing to a new line within the quotes (like after a / in the path), but not in others. If you can#t find such places (by experimenting), you will have to resort to the "column 72 -> column 1" tactic above.
Thanks a lot for #data_henrik's comment. It was really that simple:
SELECT
xmlcast(
XMLQUERY('$e//*:GrandTotalAmount'
PASSING XMLPARSE(DOCUMENT xmlcol) AS "e"
) AS integer)
FROM
mytable
That great because there are half a dozen namespaces in the XML file I would actually have to declare to get all the other elements/attributes iI need.

How to decode the HTML characters in SQL query [duplicate]

I have just set about the task of stripping out HTML entities from our database, as we do a lot of crawling and some of the crawlers didn't do this at input time :(
So I started writing a bunch of queries that look like;
UPDATE nodes SET name=regexp_replace(name, 'à', 'à', 'g') WHERE name LIKE '%#xe0%';
UPDATE nodes SET name=regexp_replace(name, 'á', 'á', 'g') WHERE name LIKE '%#xe1%';
UPDATE nodes SET name=regexp_replace(name, 'â', 'â', 'g') WHERE name LIKE '%#xe2%';
Which is clearly a pretty naive approach. I've been trying to figure out if there is something clever I can do with the decode function; maybe grabbing the html entity by regex like /&#x(..);/, then passing just the %1 part to the ascii decoder, and reconstructing the string...or something...
Shall I just press on with the queries? There will probably only be 40 or so of them.
Write a function using pl/perlu and use this module https://metacpan.org/pod/HTML::Entities
Of course you need to have perl installed and pl/perl available.
1)
First of all create the procedural language pl/perlu:
CREATE EXTENSION plperlu;
2) Then create a function like this:
CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
use HTML::Entities;
return decode_entities($_[0]);
$$ LANGUAGE plperlu;
3) Then you can use it like this:
select decode_html_entities('aaabbb&.... asasdasdasd …');
decode_html_entities
---------------------------
aaabbb&.... asasdasdasd …
(1 row)
You could use xpath (HTML-encoded content is the same as XML encoded content):
select
'AT&T' as input ,
(xpath('/z/text()', ('<z>' || 'AT&T' || '</z>')::xml))[1] as output
This is what it took for me to get working on Ubuntu 18.04 with PG10, and Perl didn't decode some entities like &comma; for some reason. So I used Python3.
From the command line
sudo apt install postgresql-plpython3-10
From your SQL interface:
CREATE LANGUAGE plpython3u;
CREATE OR REPLACE FUNCTION htmlchars(str TEXT) RETURNS TEXT AS $$
from html.parser import HTMLParser
h = HTMLParser()
if str is None:
return str
return h.unescape(str);
$$ LANGUAGE plpython3u;

Postgres PL/pgSQL Function results to file, with filename as argument

I am migrating some client side stuff into the server, and want to put it into a function.
I need to get the results of a query into a CSV file. But, I'd like to pass the file name/location of the resulting file as an argument of the function.
So, this is a simple example of what I want to do:
CREATE FUNCTION send_email_results(filename1 varchar) RETURNS void AS $$
DECLARE
BEGIN
COPY(SELECT * FROM mytable) TO filename1 WITH CSV;
END;
$$ LANGUAGE plpgsql;
Postgres is complaining about this though, as it is translating the filename1 argument to '$1', and it doesn't know what to do.
I can hardcode the path if need be, but being able to pass it as a parameter sure would be handy.
Anyone have any clues?
I just ran in to this. It turns out that you can't use parameterized arguments when the copy command is used (at least that's the case with python as the stored proc language). So, you have to build the command without arguments, like:
CREATE FUNCTION send_email_results(filename1 varchar) RETURNS void AS $$
DECLARE
BEGIN
execute 'copy (select * frommytable) to ' || filename1 || ' with csv;';
END;
$$ LANGUAGE plpgsql;
You might have to use the quoting feature to make it a little more readable. I don't know, I don't use plpgsql as a postgres function language, so the syntax might be wrong.
execute 'copy (select * frommytable) to ' || quote_literal(filename1) || ' with csv;'

How do you escape this regular expression?

I'm looking for
"House M.D." (2004)
with anything after it. I've tried where id~'"House M\.D\." \(2004\).*'; and there's no matches
This works id~'.*House M.D..*2004.*'; but is a little slow.
I suspect you're on an older PostgreSQL version that interprets strings in a non standards-compliant C-escape-like mode by default, so the backslashes are being treated as escapes and consumed. Try SET standard_conforming_strings = 'on';.
As per the lexical structure documentation on string constants, you can either:
Ensure that standard_conforming_strings is on, in which case you must double any single quotes (ie ' becomes '') but backslashes aren't treated as escapes:
id ~ '"House M\.D\." \(2004\)'
Use the non-standard, PostgreSQL-specific E'' syntax and double your backslashes:
id ~ E'"House M\\.D\\." \\(2004\\)'
PostgreSQL versions 9.1 and above set standard_conforming_strings to on by default; see the documentation.
You should turn it on in older versions after testing your code, because it'll make updating later much easier. You can turn it on globally in postgresql.conf, on a per-user level with ALTER ROLE ... SET, on a per-database level with ALTER DATABASE ... SET or on a session level with SET standard_conforming_strings = on. Use SET LOCAL to set it within a transaction scope.
Looks that your regexp is ok
http://sqlfiddle.com/#!12/d41d8/113
CREATE OR REPLACE FUNCTION public.regexp_quote(IN TEXT)
RETURNS TEXT
LANGUAGE plpgsql
STABLE
AS $$
/*******************************************************************************
* Function Name: regexp_quote
* In-coming Param:
* The string to decoded and convert into a set of text arrays.
* Returns:
* This function produces a TEXT that can be used as a regular expression
* pattern that would match the input as if it were a literal pattern.
* Description:
* Takes in a TEXT in and escapes all of the necessary characters so that
* the output can be used as a regular expression to match the input as if
* it were a literal pattern.
******************************************************************************/
BEGIN
RETURN REGEXP_REPLACE($1, '([[\\](){}.+*^$|\\\\?-])', '\\\\\\1', 'g');
END;
$$
Test:
SELECT regexp_quote('"House M.D." (2004)'); -- produces: "House M\\.D\\." \\(2004\\)

Learning jython string manipulation

I'm learning jython, and I want to see how to replace the suffix of a string.
For example, I have string:
com.foo.ear
and I want to replace the suffix to get:
com.foo.war
I cannot get replace or re.sub to work
You mention re.sub; here's one way to use that:
import re
re.sub('.ear$','.war','com.foo.ear')
# -> 'com.foo.war'
The $ matches the end of the string.
Using replace would be even simpler:
'com.foo.ear'.replace('ear','war')
# -> 'com.foo.war'
Edit:
And since that looks like a path, you may want to look into using os.path.splitext:
'{0}{1}'.format(os.path.splitext('com.foo.ear')[0],'.war')
# -> 'com.foo.war'