How do I eliminate the spaces? - scrapy

I would like to collect the Japanese articles searched by google. I try to extract Japanese sentences, then I run the following code in order to get the tag including the most Japanese words.
texts = mostTag.xpath('<<path>>/text()').extract()
text = ''
for s in texts:
text += s
but when I run this code, extracted sentence has spaces in their head.
For example, If html is as below and path is '//p',
<p class dir='sample'>
<span>
<a role='button' tabindex='0' style='white-space: normal;'>A
B</a>
<span> </span>
</span>
</p>
I got the sentences as below.
A
B
I tried to eliminate this spaces by method 'text.strip()', but the spaces remained.
How do I get the 'AB' from this html? Or how do I eliminate the spaces? I appreciate it if anyone tell me how to get 'AB'.

This can be done with a regular expression:
>>> import re
>>> re.sub(r'\n\s+', '', s)
'AB'

Related

How to tokenize html tags with spacy?

I need to tokenize html text with spacy. Or merge tags after tokenization. They can be any html tags, e.g.:
<br> <br/> <br > <n class="ggg">
There is an example of tag merging in documentation for tag, but it can't work with all types of tags. If I write rule like:
[{'ORTH': '<'}, {}, {'ORTH': '>'}]
It will join some tags:
<br><p>
Or separate like:
<
n
class="ggg
"
>
I have tried to write custom tokenizer also, but I had problem with spaces.
I want every html tag to be a separate token, e.g.:
<br>
<br >
<n class="ggg">
IMHO, removing the HTML tags and converting to plain text is the correct way to go, rather than making html tags 'stop words', because some of those tags are actually valid words that can appear in text and should NOT be ignored (e.g., <body> vs body).
If you have a construct like
<span>word</span><span>word</span>
It renders as wordword in a user agent and should in fact be interpreted as a single word. For example, one might give you an HTML page containing something like:
<p><strong>S</strong>oup .... </p>
This obviously renders as 'Soup' and should be taken as the word soup and not as the words s and oup.
Now, if for whatever reason you must assume that any HTML tag boundary is a word separator (wrong, in most cases), you should do the following: use an HTML stream tokenizer, e.g., libxml2 and write handlers for startElement and characters only. The former should output a single space and the latter should output the characters as it gets them. This will convert your HTML input to plain text (just like an HTML tag remover would do), but also add a space after each element tag, so <span>word</span><span>word</span> would get converted to: "(space)word(space)word". This might add multiple spaces when nested tags are present, but you can easily deal with this when you split the cleaned-up text into words for further processing.

multiline code block in markdown adds unwanted tabs

today I'm implementing my page in nanoc (haml templates) and I wanted to write some posts in markdown, but when it goes to multiline code blocks something weird is happening - second line in code block has additional tabs. I've tried multiple markdown syntaxes, such as:
//double tab wrapping
line 1 is fine
line 2 is wrapping (don't know why!)
and
~~~
//tilde code wrapping
line 1 is fine
line 2 is wrapping
~~~
And both solutions gives me the result something like this:
line 1 is fine
line 2 is wrapping
Inspecting elements through browser shows that there is no additional padding - this whitespace is made with tabs for sure.
Can someone help me with this? Maybe I'm doing something wrong?
When you use = in Haml to include the results of a script, Haml will re-indent the inserted text so that it matches the indentation of where it is included. For example, if you have Haml that looks something like this:
%html
%body
.foo
= insert_something
and insert_something returns some HTML like this:
<p>
This is possily generated from Markdown.
</p>
then the resulting HTML will look like this:
<html>
<body>
<div class='foo'>
<p>
This is possily generated from Markdown.
</p>
</div>
</body>
</html>
Note how the p element is indented to match its position in the document.
Normally this doesn’t matter, because of the way whitespace in HTML is collapsed. However there are HTML elements where the whitespace is important, in particular pre.
What it looks like is happening here is that your Markdown is generating something like
<pre><code>line 1 is fine
line 2 is wrapping
</code></pre>
and when it is included in your Haml file (I’m guessing you’re using Haml layouts with = yield to include the Markdown) it is being indented and the whitespace is showing up when you view the page. Note how the first line is immediately after the opening tags so there is no extra whitespace.
There are a couple of ways to fix this. If you set the :ugly option then Haml won’t re-indent blocks like this (sorry I don’t know how you set Haml options in Nanoc).
You could also use the find_and_preserve helper method. This replaces all newlines in whitespace sensitive tags with HTML entity
, so that they won’t be affected by the extra whitespace when indented:
= find_and_preserve(yield)
Haml provides an easy way to use find_and_preserve; ~ works the same as =, except that it runs find_and_preserve on the result, so you can do:
~ yield

remove break tags from string using regex

This works great removing ALL html from a string/DB text type field, how can I omit break tags:
update hazHRA set identityRisk=dbo.RegexReplace('<(?:[^>''"]*|([''"]).*?\1)*>',
'',identityRisk,1,1);
I wish to keep the
<br>
only
This should do the job:
(?i)<(?:(?!br>|br/>)[^>'"]*|(['"]).*?\1)*>
(?i): Case insensitive.
(?!br>|br/>): Negative lookahead.
Online demo.
If you could use quantifiers in lookaheads you may use this:
(?i)<(?:(?!br\s*>|br\s*/>)[^>'"]*|(['"]).*?\1)*>
This will ensure to not match <br > with spaces.
Online demo.

How to handle new line in handlebar.js

I am using HandleBar.js in my rails jquery mobile application.
I have a json returned value data= "hi\n\n\n\n\nb\n\n\n\nhow r u"
which when used in .hbs file as {{data}} showing me as hi how r u and not as with the actual new line inserted
Please suggest me.
Pre tag helps me
Handlebars doesn't mess with newlines in your data unless you have registered a helper which is doing something with them. A good way of dealing with newlines in HTML without converting them to br tags would be to use the CSS property white-space while rendering the handlebars template in HTML. You can set its value to pre-line.
Read the related documentation on MDN
Look at the source of the generated file - your newline characters are probably there, HTML simply does not render newline characters as new lines.
You can insert a linebreak with <br />
However, it looks like you're trying to format the position of your lines using newline characters, which technically should be done by wrapping your lines in <p> or <div> tags and styling with CSS.
Simply use the CSS property white-space and set the value as pre-line
For a example:
<p style="white-space: pre-line">
{{text}}
</p>

HtmlPurifier: Remove Ambiguous WhiteSpace?

Is it possible to remove ambiguous whitespace from a string containing html with HTMLPurifier? I wasn't able to find anything relevant in the documentation. I'm also open to other methods if they exist.
Example Input:
Test Line1
<p>
Test Line2</p>
<p>
Line 3</p>
Optimal Output:
Test Line1<p>Test Line2</p><p>Line 3</p>
Thanks!!!!!!
Can regular expression help?
$text = preg_replace('/\s+/u', '', $text)