HtmlPurifier: Remove Ambiguous WhiteSpace? - formatting

Is it possible to remove ambiguous whitespace from a string containing html with HTMLPurifier? I wasn't able to find anything relevant in the documentation. I'm also open to other methods if they exist.
Example Input:
Test Line1
<p>
Test Line2</p>
<p>
Line 3</p>
Optimal Output:
Test Line1<p>Test Line2</p><p>Line 3</p>
Thanks!!!!!!

Can regular expression help?
$text = preg_replace('/\s+/u', '', $text)

Related

How to tokenize html tags with spacy?

I need to tokenize html text with spacy. Or merge tags after tokenization. They can be any html tags, e.g.:
<br> <br/> <br > <n class="ggg">
There is an example of tag merging in documentation for tag, but it can't work with all types of tags. If I write rule like:
[{'ORTH': '<'}, {}, {'ORTH': '>'}]
It will join some tags:
<br><p>
Or separate like:
<
n
class="ggg
"
>
I have tried to write custom tokenizer also, but I had problem with spaces.
I want every html tag to be a separate token, e.g.:
<br>
<br >
<n class="ggg">
IMHO, removing the HTML tags and converting to plain text is the correct way to go, rather than making html tags 'stop words', because some of those tags are actually valid words that can appear in text and should NOT be ignored (e.g., <body> vs body).
If you have a construct like
<span>word</span><span>word</span>
It renders as wordword in a user agent and should in fact be interpreted as a single word. For example, one might give you an HTML page containing something like:
<p><strong>S</strong>oup .... </p>
This obviously renders as 'Soup' and should be taken as the word soup and not as the words s and oup.
Now, if for whatever reason you must assume that any HTML tag boundary is a word separator (wrong, in most cases), you should do the following: use an HTML stream tokenizer, e.g., libxml2 and write handlers for startElement and characters only. The former should output a single space and the latter should output the characters as it gets them. This will convert your HTML input to plain text (just like an HTML tag remover would do), but also add a space after each element tag, so <span>word</span><span>word</span> would get converted to: "(space)word(space)word". This might add multiple spaces when nested tags are present, but you can easily deal with this when you split the cleaned-up text into words for further processing.

How do I eliminate the spaces?

I would like to collect the Japanese articles searched by google. I try to extract Japanese sentences, then I run the following code in order to get the tag including the most Japanese words.
texts = mostTag.xpath('<<path>>/text()').extract()
text = ''
for s in texts:
text += s
but when I run this code, extracted sentence has spaces in their head.
For example, If html is as below and path is '//p',
<p class dir='sample'>
<span>
<a role='button' tabindex='0' style='white-space: normal;'>A
B</a>
<span> </span>
</span>
</p>
I got the sentences as below.
A
B
I tried to eliminate this spaces by method 'text.strip()', but the spaces remained.
How do I get the 'AB' from this html? Or how do I eliminate the spaces? I appreciate it if anyone tell me how to get 'AB'.
This can be done with a regular expression:
>>> import re
>>> re.sub(r'\n\s+', '', s)
'AB'

remove break tags from string using regex

This works great removing ALL html from a string/DB text type field, how can I omit break tags:
update hazHRA set identityRisk=dbo.RegexReplace('<(?:[^>''"]*|([''"]).*?\1)*>',
'',identityRisk,1,1);
I wish to keep the
<br>
only
This should do the job:
(?i)<(?:(?!br>|br/>)[^>'"]*|(['"]).*?\1)*>
(?i): Case insensitive.
(?!br>|br/>): Negative lookahead.
Online demo.
If you could use quantifiers in lookaheads you may use this:
(?i)<(?:(?!br\s*>|br\s*/>)[^>'"]*|(['"]).*?\1)*>
This will ensure to not match <br > with spaces.
Online demo.

How to handle new line in handlebar.js

I am using HandleBar.js in my rails jquery mobile application.
I have a json returned value data= "hi\n\n\n\n\nb\n\n\n\nhow r u"
which when used in .hbs file as {{data}} showing me as hi how r u and not as with the actual new line inserted
Please suggest me.
Pre tag helps me
Handlebars doesn't mess with newlines in your data unless you have registered a helper which is doing something with them. A good way of dealing with newlines in HTML without converting them to br tags would be to use the CSS property white-space while rendering the handlebars template in HTML. You can set its value to pre-line.
Read the related documentation on MDN
Look at the source of the generated file - your newline characters are probably there, HTML simply does not render newline characters as new lines.
You can insert a linebreak with <br />
However, it looks like you're trying to format the position of your lines using newline characters, which technically should be done by wrapping your lines in <p> or <div> tags and styling with CSS.
Simply use the CSS property white-space and set the value as pre-line
For a example:
<p style="white-space: pre-line">
{{text}}
</p>

how to include an asterisk a url within Django?

I have the following code in my Django template:
<a href="{% url myapp.views.myview foobar %}">
so, what's the right way to handle the situation where "foobar" contains an asterisk (for example, "*1234")? At the moment, Django is throwing this error:
Caught NoReverseMatch while rendering: Reverse for 'myapp.views.myview.myview'
with arguments '(u'*86743',)' and keyword arguments '{}' not found.
Commonly, You'd see this if you're specifying a url rule with a \w regex. For example:(r'^myview/(?P<myparam>\w.+)/$', 'myview') would cause this. Broadening the regex a bit to allow the ampersand should cure your ills.
Make sure that one of the entries in your urlconf actually matches the arguments you pass. If you only allow digits ([0-9]+) then an asterisk won't match.