Does JSSoup support extracting text? - beautifulsoup

Does JSSoup support extracting text similar to Beautiful Soup soup.findAll(text=True)?
The documentation does not provide any information about this use case, but seems to me that there should be a way.
To clarify what I want is to grab all visible text from the page.

In beautiful soup you can extract text in different ways with find_all(text=True) but also with .get_text() or .text.
JSSoup works similar to beautiful soup - To extract all visible text just call .get_text(), .text or string on your soup.
Example (jssoup)
var soup = new JSSoup('<html><head><body>text<p>ptext</p></body></head></html>');
soup.get_text('|')
// 'text|ptext'
soup.get_text('|').split('|')
// ['text','ptext']
Example (beautiful soup)
from bs4 import BeautifulSoup
html = '''<html><head><body>text<p>ptext</p></body></head></html>'''
soup = BeautifulSoup(html, "html.parser")
print(soup.get_text('|').split('|'))
Output
['text','ptext']

Related

issue with parsing wiki.js webpage's HTML content using beautifulsoup

I am using beautifulsoup python module to parse HTML content of a wiki.js based webpage. However, I am having trouble extracting the text component of the header and paragraph tags.
I have tried .getText() method and .text property, but wasn't able to extract the text from the header/paragraph tags.
Below is the code snippet for reference:
import requests
from bs4 import BeautifulSoup
# a random webpage built using wiki.js
url = "https://brgswiki.org/"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
heading_tags = ["h1","h2"]
for tags in soup.find_all(heading_tags):
print("=============================================")
print(f"complete Header Tag with the text:\n{tags}")
print("=============================================")
print("just header tag_name and header text_content")
print(tags.name + ' -> ' + tags.text.strip())
And here's the output:
=============================================
complete Header Tag with the text:
<h2 class="toc-header" id="subscribe-to-our-new-newsletter"><a class="toc-anchor" href="#subscribe-to-our-new-newsletter">¶</a> <em>Subscribe to our new newsletter!</em></h2>
=============================================
just header tag_name and header text_content
h2 ->
As you see in the output the h2 tag text -"Subscribe to our new newsletter!" is not being extracted
I see this issue with just the webpages built on wiki.js, the other webpages work just fine.
Any suggestion/guidance on how to get around this issue is appreciated.
Thank you.

Iterating through a findall div with beautifulsoup

Using Beautiful Soup I would like to iterate through each of the div data-search-sol-meta={blah:blah...} and print all of the contents inside of the div.
page = requests.get('https://www.seek.com.au/python-junior-jobs', headers=header)
soup = BeautifulSoup(page.content, 'html.parser')
section = soup.find('div', {'class':'_3MPUOLE'})
for div in section.findAll('div.data-search-sol-meta'): #<-- having difficulty with this
print(div)
print("\n")
Question:
How can I go through the website and iterate through all of the div.data-search-sol-meta so that I can print and further process the contents of the div?
Try changing yor for loop to
for div in section.select('div[data-search-sol-meta]'):
and see if it works.
I took a look at the page you are trying to sparse and I'd suggest using results = soup.find_all('article')

How do I avoid the 'NavigableString' error with BeautifulSoup and get to the text of href?

This is what I have:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "http://python.beispiel.programmierenlernen.io/index.php"
doc = requests.get(url).content
soup = BeautifulSoup(doc, "html.parser")
for i in soup.find("div", {"class":"navigation"}):
print(i)
Currently the print output of "i" is:
<a class="btn btn-primary" href="index.php?page=2">Zur nächsten Seite!</a>
I want to print out the href link "index.php?page=2".
When I try to use BeautifulSoups "find", "select" or "attrs" method on "i" I get an error. For instance with
print(i.attrs["href"])
I get:
AttributeError: 'NavigableString' object has no attribute 'attrs'
How do I avoid the 'NavigableString' error with BeautifulSoup and get the text of href?
The issue seems to be for i in soup.find. If you're looking for only one element, there's no need to iterate that element, and if you're looking for multiple elements, find_all instead of find would probably match the intent.
More concretely, here are the two approaches. Beyond what's been mentioned above, note that i is a div that contains the desired a as a child, so we need an extra step to reach it (this could be more direct with an xpath).
import requests
from bs4 import BeautifulSoup
url = "http://python.beispiel.programmierenlernen.io/index.php"
doc = requests.get(url).content
soup = BeautifulSoup(doc, "html.parser")
for i in soup.find_all("div", {"class": "navigation"}):
print(i.find("a", href=True)["href"])
print(soup.find("div", {"class": "navigation"})
.find("a", href=True)["href"])
Output:
index.php?page=2
index.php?page=2

Getting a specific part of a website with Beautiful Soup 4

I got the basics down of finding stuff with Beautiful Soup 4. However right now I am stuck with a specific problem.I want to scrape the "2DKT94P" from the data-oid of the below code:
<div class="js-object listitem_wrap " data-estateid="45784882" data-oid="2DKT94P">
<div class="listitem relative js-listitem ">
Any pointers on how I might do this? I would also appreciate a pointer for an advanced tutorial that covers this, and/or a link on where I would have been able to find this in the official documentation because I failed to recognize the correct part...
Thanks in advance!
you should locate the div tag using class attribute then get it's data-oid attribute
div = soup.find("div", class_="js-object")
oid = div['data-oid']
If your data is well formated you can do this via this way:
from bs4 import BeautifulSoup
example = """
<div class="js-object listitem_wrap " data-estateid="45784882" data-
oid="2DKT94P">
<div class="listitem relative js-listitem ">2DKT94P DIV</div>
</div>
<div>other div</div>"""
soup = BeautifulSoup(example, "html.parser")
RandomDIV = soup.find(attrs= {"data-oid":"2DKT94P"})
print (RandomDIV.get_text().strip())
Outputs:
2DKT94P DIV
Find more info about find or find_all with attributes here.
Or via select:
RandomDIV = soup.select("div[data-oid='2DKT94P']")
print (RandomDIV[0].get_text().strip())
Find more about select.
EDIT:
Totally misunderstood the question. If you want to search only for data-oid you can do like this:
soup = BeautifulSoup(example, "html.parser")
RandomDIV = soup.find_all(lambda tag: [t for t in tag.attrs if
t == 'data-oid'])
for div in RandomDIV:
#data-oid
print(div["data-oid"])
#text
print (div.text.strip())
Learn more here.

Getting inner tag text whilst using a filter in BeautifulSoup

I have:
... html
<div id="price">$199.00</div>
... html
How do I get the $199.00 text. Using
soup.findAll("div",id="price",text=True)
does not work as I get all the innet text from the whole document.
Find div tag, and use text attribute to get text inside the tag.
>>> from bs4 import BeautifulSoup
>>>
>>> html = '''
... <html>
... <body>
... <div id="price">$199.00</div>
... </body>
... </html>
... '''
>>> soup = BeautifulSoup(html)
>>> soup.find('div', id='price').text
u'$199.00'
You are SO close to make it work.
(1) How to search and locate the tag that you are interested:
Let's take a look at how to use find_all function:
find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs):...
name="div":The name attribute will contains the tag name
attrs={"id":"price"}: The attrs is a dictionary who contains the attribute
recursive: a flag whether dive into its children or not.
text: could be used along with regular expressions to search for tags which contains certain text
limit: is a flag to choose how many you want to return limit=1 make find_all the same as find
In your case, here are a list of commands to locate the tags playing with different flags:
>> # in case you have multiple interesting DIVs I am using find_all here
>> html = '''<html><body><div id="price">$199.00</div><div id="price">$205.00</div></body></html>'''
>> soup = BeautifulSoup(html)
>> print soup.find_all(attrs={"id":"price"})
[<div id="price">$199.00</div>, <div id="price">$205.00</div>]
>> # This is a bit funky but sometime using text is extremely helpful
>> # because text is actually what human will see so it is more reliable
>> import re
>> tags = [text.parent for text in soup.find_all(text=re.compile('\$'))]
>> print tags
[<div id="price">$199.00</div>, <div id="price">$205.00</div>]
There are many different ways to locate your elements and you just need to ask yourself, what will be the most reliable way to locate a element.
More Information about BS4 Find, click here.
(2) How to get the text of a tag:
tag.text will return unicode and you can convert to string type by using tag.text.encode('utf-8')
tag.string will also work.