I am using beautifulsoup python module to parse HTML content of a wiki.js based webpage. However, I am having trouble extracting the text component of the header and paragraph tags.
I have tried .getText() method and .text property, but wasn't able to extract the text from the header/paragraph tags.
Below is the code snippet for reference:
import requests
from bs4 import BeautifulSoup
# a random webpage built using wiki.js
url = "https://brgswiki.org/"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
heading_tags = ["h1","h2"]
for tags in soup.find_all(heading_tags):
print("=============================================")
print(f"complete Header Tag with the text:\n{tags}")
print("=============================================")
print("just header tag_name and header text_content")
print(tags.name + ' -> ' + tags.text.strip())
And here's the output:
=============================================
complete Header Tag with the text:
<h2 class="toc-header" id="subscribe-to-our-new-newsletter"><a class="toc-anchor" href="#subscribe-to-our-new-newsletter">ΒΆ</a> <em>Subscribe to our new newsletter!</em></h2>
=============================================
just header tag_name and header text_content
h2 ->
As you see in the output the h2 tag text -"Subscribe to our new newsletter!" is not being extracted
I see this issue with just the webpages built on wiki.js, the other webpages work just fine.
Any suggestion/guidance on how to get around this issue is appreciated.
Thank you.
Related
Does JSSoup support extracting text similar to Beautiful Soup soup.findAll(text=True)?
The documentation does not provide any information about this use case, but seems to me that there should be a way.
To clarify what I want is to grab all visible text from the page.
In beautiful soup you can extract text in different ways with find_all(text=True) but also with .get_text() or .text.
JSSoup works similar to beautiful soup - To extract all visible text just call .get_text(), .text or string on your soup.
Example (jssoup)
var soup = new JSSoup('<html><head><body>text<p>ptext</p></body></head></html>');
soup.get_text('|')
// 'text|ptext'
soup.get_text('|').split('|')
// ['text','ptext']
Example (beautiful soup)
from bs4 import BeautifulSoup
html = '''<html><head><body>text<p>ptext</p></body></head></html>'''
soup = BeautifulSoup(html, "html.parser")
print(soup.get_text('|').split('|'))
Output
['text','ptext']
I tried fins 'td' tag with specific attribute, and then find 'a' tag inside of the 'td' tag
for row in bs4.find_all('<td class="series-column"'):
for link in bs4.find_all('a'):
if link.has_attr('href') and (link.has_attr('class') == 'formatted-title external-link result-url'):
print(link.attrs['href'])
On the screenshot you see html for this page
Your bs4.find_all('<td class="series-column"') is wrong. You have to supply tag name and attributes you want to find, for example bs4.find_all('td', class_='series-column'). Or use CSS selector:
from bs4 import BeautifulSoup
txt = '''
<td class="series-column">
<a class="formatted-title external-link result-url" href="//knoema.com/...">link text</a>
</td>'''
soup = BeautifulSoup(txt, 'html.parser')
for link in soup.select('td.series-column a.formatted-title.external-link.result-url'):
print(link['href'])
Prints:
//knoema.com/...
Here is the html string in question.
<div class="def ddef_d db">a <a class="query" href="https://dictionary.cambridge.org/us/dictionary/english/book" title="book">book</a> of grammar <a class="query" href="https://dictionary.cambridge.org/us/dictionary/english/rule" title="rules">rules</a>: </div>
With BeautifulSoup, this code
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmltxt, 'lxml')
soup.text
gets me
a book of grammar rules:
which is exactly what I want.
With scrapy, how do I get the same result?
from scrapy import Selector
sel = Selector(text=htmltxt)
sel.css('.ddef_d::text').getall()
this code gets me
['a ', ' of grammar ', ': ']
How should I fix it?
aYou can use this code to get all text inside div and its child:
text = ''.join(sel.css('.ddef_d ::text').getall())
print(text)
your selector returns text only from the div, but part of text located inside child elements (a), that's why you have to add space before ::text to include child text into result.
I'm trying to scrape content from listing detail page that can only be viewed by clicking the 'view' button which triggers a form submit . I am new to both Python and Scrapy
Example markup
<li><h3>Abc Widgets</h3>
<form action="/viewlisting?id=123" method="post">
<input type="image" src="/images/view.png" value="submit" >
</form>
</li>
My solution in Scrapy is to extract form actions then use Request to return the page with a callback to parse it for for the desired content. However I have hit a few issues
I'm getting the following error "request url must be str or unicode"
secondly when I hardcode a URL to overcome the above issue it seems my parsing function is returning what looks like a list
Here is my code - with reactions of the real URLs
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from wfi2.items import Wfi2Item
class ProfileSpider(Spider):
name = "profiles"
allowed_domains = ["wfi.com.au"]
start_urls = ["http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=WA",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=VIC",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=QLD",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=NSW",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=TAS"
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=NT"
]
def parse(self, response):
hxs = Selector(response)
forms = hxs.xpath('//*[#id="area-managers"]//*/form')
for form in forms:
action = form.xpath('#action').extract()
print "ACTION: ", action
#request = Request(url=action,callback=self.parse_profile)
request = Request(url=action,callback=self.parse_profile)
yield request
def parse_profile(self, response):
hxs = Selector(response)
profile = hxs.xpath('//*[#class="contentContainer"]/*/text()')
print "PROFILE", profile
I'm getting the following error "request url must be str or unicode"
Please have a look at the scrapy documentation for extract(). It says : "Serialize and return the matched nodes as a list of unicode strings" (bold added by me).
The first element of the list is probably what you want. So you could do something like:
request = Request(url=response.urljoin(action[0]), callback=self.parse_profile)
secondly when I hardcode a URL to overcome the above issue it seems my
parsing function is returning what looks like a list
According to the documentation of xpath it's a SelectorList. Add extract() to the xpath and you'll get a list with the text tokens. Eventually you want to clean up and join the elements that list before further processing.
I have:
... html
<div id="price">$199.00</div>
... html
How do I get the $199.00 text. Using
soup.findAll("div",id="price",text=True)
does not work as I get all the innet text from the whole document.
Find div tag, and use text attribute to get text inside the tag.
>>> from bs4 import BeautifulSoup
>>>
>>> html = '''
... <html>
... <body>
... <div id="price">$199.00</div>
... </body>
... </html>
... '''
>>> soup = BeautifulSoup(html)
>>> soup.find('div', id='price').text
u'$199.00'
You are SO close to make it work.
(1) How to search and locate the tag that you are interested:
Let's take a look at how to use find_all function:
find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs):...
name="div":The name attribute will contains the tag name
attrs={"id":"price"}: The attrs is a dictionary who contains the attribute
recursive: a flag whether dive into its children or not.
text: could be used along with regular expressions to search for tags which contains certain text
limit: is a flag to choose how many you want to return limit=1 make find_all the same as find
In your case, here are a list of commands to locate the tags playing with different flags:
>> # in case you have multiple interesting DIVs I am using find_all here
>> html = '''<html><body><div id="price">$199.00</div><div id="price">$205.00</div></body></html>'''
>> soup = BeautifulSoup(html)
>> print soup.find_all(attrs={"id":"price"})
[<div id="price">$199.00</div>, <div id="price">$205.00</div>]
>> # This is a bit funky but sometime using text is extremely helpful
>> # because text is actually what human will see so it is more reliable
>> import re
>> tags = [text.parent for text in soup.find_all(text=re.compile('\$'))]
>> print tags
[<div id="price">$199.00</div>, <div id="price">$205.00</div>]
There are many different ways to locate your elements and you just need to ask yourself, what will be the most reliable way to locate a element.
More Information about BS4 Find, click here.
(2) How to get the text of a tag:
tag.text will return unicode and you can convert to string type by using tag.text.encode('utf-8')
tag.string will also work.