Escape non HTML tags from plain text - html-sanitizing

I need to escape non HTML tags to convert a plain text into a valid HTML, how can I do that?
I'm using ruby, but I may use an external tool

You can use CGI::escapeHTML to escape all special HTML characters so you can safely paste the content into HTML as text.

Related

How to tokenize html tags with spacy?

I need to tokenize html text with spacy. Or merge tags after tokenization. They can be any html tags, e.g.:
<br> <br/> <br > <n class="ggg">
There is an example of tag merging in documentation for tag, but it can't work with all types of tags. If I write rule like:
[{'ORTH': '<'}, {}, {'ORTH': '>'}]
It will join some tags:
<br><p>
Or separate like:
<
n
class="ggg
"
>
I have tried to write custom tokenizer also, but I had problem with spaces.
I want every html tag to be a separate token, e.g.:
<br>
<br >
<n class="ggg">
IMHO, removing the HTML tags and converting to plain text is the correct way to go, rather than making html tags 'stop words', because some of those tags are actually valid words that can appear in text and should NOT be ignored (e.g., <body> vs body).
If you have a construct like
<span>word</span><span>word</span>
It renders as wordword in a user agent and should in fact be interpreted as a single word. For example, one might give you an HTML page containing something like:
<p><strong>S</strong>oup .... </p>
This obviously renders as 'Soup' and should be taken as the word soup and not as the words s and oup.
Now, if for whatever reason you must assume that any HTML tag boundary is a word separator (wrong, in most cases), you should do the following: use an HTML stream tokenizer, e.g., libxml2 and write handlers for startElement and characters only. The former should output a single space and the latter should output the characters as it gets them. This will convert your HTML input to plain text (just like an HTML tag remover would do), but also add a space after each element tag, so <span>word</span><span>word</span> would get converted to: "(space)word(space)word". This might add multiple spaces when nested tags are present, but you can easily deal with this when you split the cleaned-up text into words for further processing.

busybox httpd cgi doesn't print "return"

Please help, I can't find the solution
Situation. I have busybox httpd server. In cgi-bin folder is an cgi-executable, which sends to client formatted text by printf command.
Problem is that the text format should look like a column, but client receives only a string. Despite the fact that in "printf" I use "\n" and "(char) 13".
Another words executable doesn't return "return" symbol
I wrote following
for (i=0; i<4;i++)
printf ("%9.8g%c\n", lTemp[i]*dTemp[i], (char) 13 );
The text that is sent from your CGI program to the web client is treated as HTML text, not plain text.
When HTML is processed for display in the browser, newline and carriage return (what you simply call "return") characters are ignored.
To cause the displayed text to perform a line break, the HTML break tag, "< br />" should be inserted into the output string:
printf("%9.8g <br />\r\n", lTemp[i] * dTemp[i]);
The use of newlines and whitespace in the text that your CGI programs generates will have little bearing on the actual HTML page that gets displayed. Use newlines and whitespace to format the HTML so that the source is readable, and use HTML tags to control the displayed text in the client's browser.
BTW
Using a numeric constant and a character conversion in a printf is not the preferred method of outputting a carriage-return character.
Use the defined escape sequence \r in the format.

How to handle new line in handlebar.js

I am using HandleBar.js in my rails jquery mobile application.
I have a json returned value data= "hi\n\n\n\n\nb\n\n\n\nhow r u"
which when used in .hbs file as {{data}} showing me as hi how r u and not as with the actual new line inserted
Please suggest me.
Pre tag helps me
Handlebars doesn't mess with newlines in your data unless you have registered a helper which is doing something with them. A good way of dealing with newlines in HTML without converting them to br tags would be to use the CSS property white-space while rendering the handlebars template in HTML. You can set its value to pre-line.
Read the related documentation on MDN
Look at the source of the generated file - your newline characters are probably there, HTML simply does not render newline characters as new lines.
You can insert a linebreak with <br />
However, it looks like you're trying to format the position of your lines using newline characters, which technically should be done by wrapping your lines in <p> or <div> tags and styling with CSS.
Simply use the CSS property white-space and set the value as pre-line
For a example:
<p style="white-space: pre-line">
{{text}}
</p>

Replacing wrong codificaton letters with SQL

I have a database with data from internet, but some pages have wrong codification and letters like ã becomes ã and çbecomes ç.
What are the possibilities to fix this? I'm using PostgreSQL.
I can use replace, but I need to do a replace for each case? I was thinking about translate, but I see that it transforms only one char into other. Is possible translate two chars into one? Something like: TRANSLATE(text,'ã|ç','ã|ç').
This particular problem looks like you have UTF-8 encoding being interpreted as a single-byte character set ("ç" becoming "ç" suggests iso-8859-1).
You can fix these up individually with a long chain of replace(...) calls. Or you can use postgresql's own character-conversion facilities:
select convert_from(convert_to('£20 - garçon', 'iso-8859-1'), 'utf-8')
In order, this:
Converts the string back to binary using the iso-8859-1 codec (which will just change unicode codepoints back to bytes, assuming all the codepoints are under 256)
Reinterprets that binary output as UTF-8, so sequences such as {0xc2, 0xa3} are translated to '£'
You can fix some of the characters by replacing them, but not all. By decoding the data using the wrong encoding you have already removed some information, and that is impossible to get back.
You should find out what the correct encoding is for those pages, and use that when decoding the data.
Some pages have the encoding in the response header, e.g.
Content-Type: text/html; charset=utf8
Some pages have the encoding in the HTML head, e.g.
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
If the information is not in the header you would first have to decode the page (or at least a part of it) using the ASCII encoding (which is not a problem as the meta tag contains no special characters), find out the encoding, then decode the page using the correct encoding.
PostgreSQL has a string replacement function:
replace(string text, from text, to text): Replace all occurrences in string of substring from with substring to
Example:
replace ('abcdefabcdef', 'cd', 'XX') ==> abXXefabXXef

Problem with word "Nestlé" in an XML doc (UTF-8 encoding) using NXXMLParser. Any idea?

We are using NSXMLParser in Objective-C to parse our XML document, which are all UTF-8 encoded. One document has a string "Nestlé" in it (as in ...<title>Nestlé Novelties</title>...). The parser just quit, reporting an error with error code=9, due to the French letter "e" at the end of the word "Nestle". Furthermore, we tried using IE, Chrome, Safari to show the same document directly. They reported a similar encoding error.
We are using UTF-8 for all incoming XML document, which means that all of them have "<?xml version="1.0" encoding="UTF-8" ?>" as the top of the document.
Is this an encoding problem? If so, how do we solve this? What encoding should we use for all of our XML documents? Thanks in advance!
Barclay
Have you checked the file with a hex editor to verify that the "é" is indeed UTF-8, 0xC3 0xA9 ?
In HTML, I would use Nestlé Does that work for your application?
Something I saw just now in an example XML file was that a string containing user-defined input (which happened to include é characters) wrapped the contents of the containing tag in CDATA declarations. This has the effect of making the parser completely ignore the characters contained therein.