Why does soup.get_text() include comments in some situations? - beautifulsoup

I have an HTML document that I've created by exporting a MS Word doc. In Word, I saved as the Web Page (.HTM) format, not the Web Page, Filtered (.HTM) format.
When I run get_text on the doc, it includes the comments in the style tag for some reason. This is unexpected. BS4 does ignore the 2nd comment in the body as expected.
I've tried with both the lxml and html.parser parsers. Same result.
Python 3.9.12, IPython 8.4.0, BS 4.8.2 (though when I use pkg_resources.get_distribution("bs4").version, it shows 0.0.1)
html = """<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=utf-8">
<meta name=Generator content="Microsoft Word 15 (filtered)">
<style>
<!--
/* Font Definitions */
#font-face
{font-family:Helvetica;
panose-1:2 11 6 4 2 2 2 2 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
text-align:justify;
text-justify:inter-ideograph;
font-size:12.0pt;
font-family:"Times New Roman",serif;}
/* Page Definitions */
#page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.25in 1.0in 1.25in;}
div.WordSection1
{page:WordSection1;}
/* List Definitions */
ol
{margin-bottom:0in;}
-->
</style>
</head>
<body lang=EN-US link=blue vlink=purple style='word-wrap:break-word'>
<div class=WordSection1>
<div style='border-top:double windowtext 2.25pt;border-left:none;border-bottom:
double windowtext 2.25pt;border-right:none;padding:1.0pt 0in 1.0pt 0in'>
<p class=MsoNormal style='margin-bottom:.25in;border:none;padding:0in'><span
style='font-size:9.0pt'> </span></p>
<!-- Some other random comment that get_text ignores -->
</div>
</body>
</html>"""
soup = BeautifulSoup(html, "lxml")
soup.get_text()
In [5]: soup.get_text()
Out[5]: '\n\n\n\n\n<!--\n /* Font Definitions */\n #font-face\n\t{font-family:Helvetica;\n\tpanose-1:2 11 6 4 2 2 2 2 2 4;}\n /* Style Definitions */\n p.MsoNormal, li.MsoNormal, div.MsoNormal\n\t{margin:0in;\n\ttext-align:justify;\n\ttext-justify:inter-ideograph;\n\tfont-size:12.0pt;\n\tfont-family:"Times New Roman",serif;}\n /* Page Definitions */\n #page WordSection1\n\t{size:8.5in 11.0in;\n\tmargin:1.0in 1.25in 1.0in 1.25in;}\ndiv.WordSection1\n\t{page:WordSection1;}\n /* List Definitions */\n ol\n\t{margin-bottom:0in;}\n-->\n\n\n\n\n\n\xa0\n\n\n\n\n'

I can't reproduce the issue [running your code just returns \n\n\n\n\n\n\n\n\n\xa0\n\n\n\n for me (bs4 4.11.1, python 3.7.15, IPython 7.9.0)], but I expect it's because everything inside the style tag is stored as one Stylesheet element rather than being further parsed into Tag/NavigableString/Comment/etc
I don't think that's a valid html anyway - stylesheet comments should be like /*Some Comment*/, and you can't just put html inside a stylesheet like that...

Related

Ignore / skip tags when prettifying with BeautifulSoup

Is it possible to ignore / skip certain tags when parsing and prettifying an HTML-document with BeautifulSoup?
I am using BeautifulSoup to prettify HTML-documents with large embedded SVG-images. There is no need to prettify the SVG-images and all of their child-elements. As performance is critical for this application, I thought I might be able to save some runtime by ignoring / skipping the SVG-elements when prettifying the HTML, and just include the SVG-elements as they originally were in the input.
I am aware of SoupStrainer but it seems to do the exact opposite of what I need. I have also read many of the posts here on StackOverflow and elsewhere, and none of them seem to address this issue.
Example
# Messy HTML code.
messy = \
"""
<html> <head>
<title>
Some title</title>
</head> <body>
<svg>Don't parse and prettify this!</svg>
</body> </html>
"""
# Prettify the HTML code.
from bs4 import BeautifulSoup
pretty = BeautifulSoup(markup=messy, features='html.parser').prettify()
Which produces the result:
<html>
<head>
<title>
Some title
</title>
</head>
<body>
<svg>
Don't parse and prettify this!
</svg>
</body>
</html>
Note that the <svg> element has also been parsed and prettified by BeautifulSoup. Is there a way to avoid this?
Thanks!
As far as I can tell, bs4 doesn't allow for skipping particular tags; but you could write your own parser (like here) and include or allow exceptions, or use regex to replace the tags you don't want to parse.
First, list the tags you want to skip parsing
skipTags = ['svg']
# skipTags = ['svg', 'script', 'style'] ## list all the tag names to skip
If you don't care about preserving the the tags, you could just get rid of them entirely.
# import re
# from bs4 import BeautifulSoup
for n in skipTags: messy = re.sub(f'<{n}\s*.*\s*>\s*.*\s*</{n}>', '', messy)
pretty = BeautifulSoup(markup=messy, features='html.parser').prettify()
If you want to preserver the tags, then replace them with comments and then replace the comment after prettifying. [This can be significantly slower than just getting rid of them.]
# import re
# from bs4 import BeautifulSoup
cReps = []
for n in skipTags:
rcpat = re.compile(f'<{n}\s*.*\s*>\s*.*\s*</{n}>')
cReps += [m.span() for m in rcpat.finditer(messy)]
for cri, (sPos, ePos) in list(enumerate(cReps))[::-1]:
repCmt, orig = f'<!--do_not_parse__placeholder_{cri}-->', messy[sPos:ePos]
messy = messy[:sPos] + repCmt + messy[ePos:]
cReps[cri] = (repCmt, orig)
pretty = BeautifulSoup(markup=messy, features='html.parser').prettify()
for repCmt, orig in cReps:
pretty = pretty.replace(repCmt, orig, 1)
print('<!--messy-subbed-->', messy, '\n<!--pretty-->', pretty, sep='\n')
Printed output of the last statement above, with the sample HTML in your question looks like:
<!--messy-subbed-->
<html> <head>
<title>
Some title</title>
</head> <body>
<!--do_not_parse__placeholder_0-->
</body> </html>
<!--pretty-->
<html>
<head>
<title>
Some title
</title>
</head>
<body>
<svg>Don't parse and prettify this!</svg>
</body>
</html>
Note that I don't know if either method will actually improve performance, especially when you consider how many more times it will be through the HTML string/s. You might want to look into https://thehftguy.com/2020/07/28/making-beautifulsoup-parsing-10-times-faster/

LinkedIn webscraping selenium issue

Element <div class="block mt2"> is not showing up when searching in output of print(soup).
# Scrap the data of 1 LinkedIn profile, write the data to a csv file
wd.get(https://www.linkedin.com/company/pacific-retail-capital-partners/)
soup = BeautifulSoup(wd.page_source, "html.parser")
soup
Output exceeds the size limit. Open the full output data in a text editor
<html class="theme theme--mercado artdeco windows" lang="en"><head>
<script type="application/javascript">!function(i,n){void 0!==i.addEventListener&&void 0!==i.hidden&&(n.liVisibilityChangeListener=function(){i.hidden&&(n.liHasWindowHidden=!0)},i.addEventListener("visibilitychange",n.liVisibilityChangeListener))}(document,window);</script>
<title>LinkedIn</title>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta class="mercado-icons-sprite" content="https://static-exp2.licdn.com/sc/h/7438dbnn8galtczp2gk2s4bgb" id="artdeco-icons/static/images/sprite-asset" name="asset-url"/>
<meta content="" name="description"/>
<meta content="notranslate" name="google"/>
<meta content="voyager-web" name="service"/>
HTML inspect
<div class="block mt2">
<div>
<h1 id="ember30" class="ember-view t-24 t-black t-bold
full-width" title="Pacific Retail Capital Partners">
<span dir="ltr">Pacific Retail Capital Partners</span>
</h1>
Since the html document is loaded in our script. We can scrape the name of the company using the div tag.
info_div = soup.find('div', {'class' : 'block mt2'})
print(info_div)
Output is null. I am not getting any information printed.
Can you explain what's happening and needed to be rectified.
It seems find() method doesn't support multiple class name.
Use the following css selector select_one() to get the company details.
info_div = soup.select_one('div.block.mt2')
print(info_div.text)
or to get company name only use this.
company = soup.select_one('div.block.mt2 h1>span')
print(company.text)
If you still want use find() method then try with this.
info_div = soup.find('div', {'class' : 'mt2'})
print(info_div.text)

How to control page breaks with react-native-html-to-pdf?

I am generating a pdf document using react-native-html-to-pdf.
When the document contains a long list of elements it is possible for some elements to span two pages within the same document.
For example this simple html:
<html>
<head></head>
<body>
<section style="border:solid 1px black;"><p>item</p></section>
<!-- sections repeat 32 times omitted for brevity -->
</body>
</html>
I get a document that looks like this at the page break:
How can I control this? Does it depend on the html elements in the document?

Rendering issue for combined font (Japanese & English ) in PDF using cfdocument

I have big trouble with "Combined Fonts" (Japanese & English).
I have to create a PDF document from HTML content which is shown in my website. For that I have used <cfdocument> and implemented the PDF from the HTML content. But my content includes both Japanese & English content and which is appear in a different font in the created PDF than what is on my website. The issue occurred only in the case of combined Japanese & English section.
The requirement is:
For English content, the font should be Verdana.
For Japanese content, the font should be Simson.
I have implemented the same with Korean, Chinese, French and it's working.
For outputting the special characters, I have added <cfprocessingDirective pageEncoding="utf-8"> above the code. But I still get weird font for the contents in both English and Japanese.
The code I have tried is given below,
<cfcontent type="application/pdf">
<cfheader name="Content-Disposition" value="attachment;filename=test.pdf">
<cfprocessingdirective pageencoding="utf-8">
<cfdocument format="PDF" localurl="yes" marginTop=".25" marginLeft=".25" marginRight=".25" marginBottom=".25" pageType="custom" pageWidth="8.5" pageHeight="10.2">
<cfoutput>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>PDF Export Example</title>
<style>
body { font-family: Verdana; }
h1 { font-size: 14px; }
p { font-size: 12px; line-height: 1.25em; margin-left:20px;}
</style>
</head>
<body>
<h1>PDF Export Example Combined Japanese & English</h1>
<p>This is an japanese with english example
日本人は単純な音素配列論で膠着、モーラ·タイミングの言語、純粋な母音システム、
音素の母音と子音の長さ、および語彙的に重要なピッチアクセント。語順は通常、粒子が言葉の文法的機能をマ
ーキング対象オブジェクトと動詞であり、文の構造は、トピック·コメントです。文末粒子は、感情的または強調の影響を追加したり、
質問を作るために使用されます。名詞は文法的に番号や性別を持たず、何の記事はありません。動詞は主に緊張し、音声ではなく、
人のために、コンジュゲートされる。形容詞の日本の同等物は、また、結合している。日本人は動詞の形や語彙、話者の相対的な地位、
リスナーおよび掲げる者を示すと敬語の複雑なシステムを持っています。This is an example.
</p>
<h1>PDF Export English Example</h1>
<p>This is an example.
</p>
</body>
</html>
</cfoutput>
</cfdocument>
What else should I do to fix this problem?
Thank you.
As per the results I have got and research, I have found that there is an issue with PDF style rendering for English with Japanese content.
So finally I found a solution,
Apply space between the Japanese and English words.
Iterate the whole string (list with space delimiter)
Then It is possible to differentiate the English words from the Japanese words by Regular expression
Apply separate style for English words by wrapping them by span (or any) tag.
I don't know whether this is the proper solution for this issue. This is what I have done for solving the issue.

How to do standard layouts with StringTemplate?

With StringTemplate, what is the proper way to have a standard layout template such as:
<head>
..
</head>
<html>
$body()$
</html>
Where I can set the body template from my application, so that every template I use uses this fundamental layout?
Thanks.
I found it hiding in the documentation:
http://www.antlr.org/wiki/display/ST/StringTemplate+2.2+Documentation
"Include template whose name is
computed via expr. The argument-list
is a list of attribute assignments
where each assignment is of the form
attribute=expr. Example
$(whichFormat)()$ looks up
whichFormat's value and uses that as
template name. Can also apply an
indirect template to an attribute."
So my main layout template now looks like this:
<head>
<title>Sportello</title>
</head>
<html lang="en-US">
<body>
$partials/header()$
<section>$(body_template)()$</section>
$partials/footer()$
</body>
</html>
...to which I pass the subtemplate's name as an attribute.