How to decode the HTML characters in SQL query [duplicate] - sql

I have just set about the task of stripping out HTML entities from our database, as we do a lot of crawling and some of the crawlers didn't do this at input time :(
So I started writing a bunch of queries that look like;
UPDATE nodes SET name=regexp_replace(name, 'à', 'à', 'g') WHERE name LIKE '%#xe0%';
UPDATE nodes SET name=regexp_replace(name, 'á', 'á', 'g') WHERE name LIKE '%#xe1%';
UPDATE nodes SET name=regexp_replace(name, 'â', 'â', 'g') WHERE name LIKE '%#xe2%';
Which is clearly a pretty naive approach. I've been trying to figure out if there is something clever I can do with the decode function; maybe grabbing the html entity by regex like /&#x(..);/, then passing just the %1 part to the ascii decoder, and reconstructing the string...or something...
Shall I just press on with the queries? There will probably only be 40 or so of them.

Write a function using pl/perlu and use this module https://metacpan.org/pod/HTML::Entities
Of course you need to have perl installed and pl/perl available.
1)
First of all create the procedural language pl/perlu:
CREATE EXTENSION plperlu;
2) Then create a function like this:
CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
use HTML::Entities;
return decode_entities($_[0]);
$$ LANGUAGE plperlu;
3) Then you can use it like this:
select decode_html_entities('aaabbb&.... asasdasdasd …');
decode_html_entities
---------------------------
aaabbb&.... asasdasdasd …
(1 row)

You could use xpath (HTML-encoded content is the same as XML encoded content):
select
'AT&T' as input ,
(xpath('/z/text()', ('<z>' || 'AT&T' || '</z>')::xml))[1] as output

This is what it took for me to get working on Ubuntu 18.04 with PG10, and Perl didn't decode some entities like &comma; for some reason. So I used Python3.
From the command line
sudo apt install postgresql-plpython3-10
From your SQL interface:
CREATE LANGUAGE plpython3u;
CREATE OR REPLACE FUNCTION htmlchars(str TEXT) RETURNS TEXT AS $$
from html.parser import HTMLParser
h = HTMLParser()
if str is None:
return str
return h.unescape(str);
$$ LANGUAGE plpython3u;

Related

DB2 zOS: XMLQUERY with long namespace and narrow editor

This sounds like a stuipd questions nowadays. Unfortunately some of use still have to cope with technology from last millennial.
How can I use XMLQUERY with declare namespace and a namespace like urn:un:unece:uncefact:data:standard:ReusableAggregateBusinessInformationEntity:100 with an editor that is only 70 characters wide?
Basically I would like to run:
SELECT
xmlcast(
XMLQUERY('declare namespace ram="urn:un:unece:uncefact:data:standard:ReusableAggregateBusinessInformationEntity:100";
$e//ram:GrandTotalAmount'
PASSING XMLPARSE(DOCUMENT xmlcol) AS "e"
) AS integer)
FROM
mytable
But the namespace declaration is too long for the editor which only is 70 characters wide.
So far I found no way to break the declaration into multiple lines using ' || <newline> ' but any concat results in SQL Error [42601]: ILLEGAL USE OF KEYWORD PASSING
This depends on the program you use to execute these statements. With the standard DSNTEP2/SPUFI you just code up to column 72 and continue in column 1 like this (note that the column-numbering line is not part of the file, it's just the one displayed by using COLS):
//SYSTSIN DD *
DSN SYSTEM(DB2T)
RUN PROGRAM(DSNTEP2) -
PLAN (DSNTEP2)
END
//*
//SYSIN DD *
----+----1----+----2----+----3----+----4----+----5----+----6----+----7--
SELECT * FROM SOMEDATA.PLAN_TABLE WHERE EXPLAIN_TIME BETWEEN '2019-12
-13-00.00.00.000000' AND '2019-12-15-00.00.00.000000'
FETCH FIRST 500 ROWS ONLY;
/*
I thought that concatenating the query-expression should have worked, but it seems like IBM doesn't allow expressions in this place.
I managed to break up my query-expression in certain places by changing to a new line within the quotes (like after a / in the path), but not in others. If you can#t find such places (by experimenting), you will have to resort to the "column 72 -> column 1" tactic above.
Thanks a lot for #data_henrik's comment. It was really that simple:
SELECT
xmlcast(
XMLQUERY('$e//*:GrandTotalAmount'
PASSING XMLPARSE(DOCUMENT xmlcol) AS "e"
) AS integer)
FROM
mytable
That great because there are half a dozen namespaces in the XML file I would actually have to declare to get all the other elements/attributes iI need.

howto cut text from specific character in sqlite query

SQLITE Query question:
I have a query which returns string with the character '#' in it.
I would like to remove all characters after this specific character '#':
select field from mytable;
result :
text#othertext
text2#othertext
text3#othertext
So in my sample I would like to create a query which only returns :
text
text2
text3
I tried something with instr() to get the index, but instr() was not recognized as a function -> SQL Error: no such function: instr (probably old version of db . sqlite_version()-> 3.7.5).
Any hints howto achieve this ?
There are two approaches:
You can rtrim the string of all characters other than the # character.
This assumes, of course, that (a) there is only one # in the string; and (b) that you're dealing with simple strings (e.g. 7-bit ASCII) in which it is easy to list all the characters to be stripped.
You can use sqlite3_create_function to create your own rendition of INSTR. The specifics here will vary a bit upon how you're using

Postgres PL/pgSQL Function results to file, with filename as argument

I am migrating some client side stuff into the server, and want to put it into a function.
I need to get the results of a query into a CSV file. But, I'd like to pass the file name/location of the resulting file as an argument of the function.
So, this is a simple example of what I want to do:
CREATE FUNCTION send_email_results(filename1 varchar) RETURNS void AS $$
DECLARE
BEGIN
COPY(SELECT * FROM mytable) TO filename1 WITH CSV;
END;
$$ LANGUAGE plpgsql;
Postgres is complaining about this though, as it is translating the filename1 argument to '$1', and it doesn't know what to do.
I can hardcode the path if need be, but being able to pass it as a parameter sure would be handy.
Anyone have any clues?
I just ran in to this. It turns out that you can't use parameterized arguments when the copy command is used (at least that's the case with python as the stored proc language). So, you have to build the command without arguments, like:
CREATE FUNCTION send_email_results(filename1 varchar) RETURNS void AS $$
DECLARE
BEGIN
execute 'copy (select * frommytable) to ' || filename1 || ' with csv;';
END;
$$ LANGUAGE plpgsql;
You might have to use the quoting feature to make it a little more readable. I don't know, I don't use plpgsql as a postgres function language, so the syntax might be wrong.
execute 'copy (select * frommytable) to ' || quote_literal(filename1) || ' with csv;'

PostgreSQL - Replace HTML Entities

I have just set about the task of stripping out HTML entities from our database, as we do a lot of crawling and some of the crawlers didn't do this at input time :(
So I started writing a bunch of queries that look like;
UPDATE nodes SET name=regexp_replace(name, 'à', 'à', 'g') WHERE name LIKE '%#xe0%';
UPDATE nodes SET name=regexp_replace(name, 'á', 'á', 'g') WHERE name LIKE '%#xe1%';
UPDATE nodes SET name=regexp_replace(name, 'â', 'â', 'g') WHERE name LIKE '%#xe2%';
Which is clearly a pretty naive approach. I've been trying to figure out if there is something clever I can do with the decode function; maybe grabbing the html entity by regex like /&#x(..);/, then passing just the %1 part to the ascii decoder, and reconstructing the string...or something...
Shall I just press on with the queries? There will probably only be 40 or so of them.
Write a function using pl/perlu and use this module https://metacpan.org/pod/HTML::Entities
Of course you need to have perl installed and pl/perl available.
1)
First of all create the procedural language pl/perlu:
CREATE EXTENSION plperlu;
2) Then create a function like this:
CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
use HTML::Entities;
return decode_entities($_[0]);
$$ LANGUAGE plperlu;
3) Then you can use it like this:
select decode_html_entities('aaabbb&.... asasdasdasd …');
decode_html_entities
---------------------------
aaabbb&.... asasdasdasd …
(1 row)
You could use xpath (HTML-encoded content is the same as XML encoded content):
select
'AT&T' as input ,
(xpath('/z/text()', ('<z>' || 'AT&T' || '</z>')::xml))[1] as output
This is what it took for me to get working on Ubuntu 18.04 with PG10, and Perl didn't decode some entities like &comma; for some reason. So I used Python3.
From the command line
sudo apt install postgresql-plpython3-10
From your SQL interface:
CREATE LANGUAGE plpython3u;
CREATE OR REPLACE FUNCTION htmlchars(str TEXT) RETURNS TEXT AS $$
from html.parser import HTMLParser
h = HTMLParser()
if str is None:
return str
return h.unescape(str);
$$ LANGUAGE plpython3u;

Learning jython string manipulation

I'm learning jython, and I want to see how to replace the suffix of a string.
For example, I have string:
com.foo.ear
and I want to replace the suffix to get:
com.foo.war
I cannot get replace or re.sub to work
You mention re.sub; here's one way to use that:
import re
re.sub('.ear$','.war','com.foo.ear')
# -> 'com.foo.war'
The $ matches the end of the string.
Using replace would be even simpler:
'com.foo.ear'.replace('ear','war')
# -> 'com.foo.war'
Edit:
And since that looks like a path, you may want to look into using os.path.splitext:
'{0}{1}'.format(os.path.splitext('com.foo.ear')[0],'.war')
# -> 'com.foo.war'