Iterating through a findall div with beautifulsoup - beautifulsoup

Using Beautiful Soup I would like to iterate through each of the div data-search-sol-meta={blah:blah...} and print all of the contents inside of the div.
page = requests.get('https://www.seek.com.au/python-junior-jobs', headers=header)
soup = BeautifulSoup(page.content, 'html.parser')
section = soup.find('div', {'class':'_3MPUOLE'})
for div in section.findAll('div.data-search-sol-meta'): #<-- having difficulty with this
print(div)
print("\n")
Question:
How can I go through the website and iterate through all of the div.data-search-sol-meta so that I can print and further process the contents of the div?

Try changing yor for loop to
for div in section.select('div[data-search-sol-meta]'):
and see if it works.

I took a look at the page you are trying to sparse and I'd suggest using results = soup.find_all('article')

Related

Does JSSoup support extracting text?

Does JSSoup support extracting text similar to Beautiful Soup soup.findAll(text=True)?
The documentation does not provide any information about this use case, but seems to me that there should be a way.
To clarify what I want is to grab all visible text from the page.
In beautiful soup you can extract text in different ways with find_all(text=True) but also with .get_text() or .text.
JSSoup works similar to beautiful soup - To extract all visible text just call .get_text(), .text or string on your soup.
Example (jssoup)
var soup = new JSSoup('<html><head><body>text<p>ptext</p></body></head></html>');
soup.get_text('|')
// 'text|ptext'
soup.get_text('|').split('|')
// ['text','ptext']
Example (beautiful soup)
from bs4 import BeautifulSoup
html = '''<html><head><body>text<p>ptext</p></body></head></html>'''
soup = BeautifulSoup(html, "html.parser")
print(soup.get_text('|').split('|'))
Output
['text','ptext']

How do I avoid the 'NavigableString' error with BeautifulSoup and get to the text of href?

This is what I have:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "http://python.beispiel.programmierenlernen.io/index.php"
doc = requests.get(url).content
soup = BeautifulSoup(doc, "html.parser")
for i in soup.find("div", {"class":"navigation"}):
print(i)
Currently the print output of "i" is:
<a class="btn btn-primary" href="index.php?page=2">Zur nächsten Seite!</a>
I want to print out the href link "index.php?page=2".
When I try to use BeautifulSoups "find", "select" or "attrs" method on "i" I get an error. For instance with
print(i.attrs["href"])
I get:
AttributeError: 'NavigableString' object has no attribute 'attrs'
How do I avoid the 'NavigableString' error with BeautifulSoup and get the text of href?
The issue seems to be for i in soup.find. If you're looking for only one element, there's no need to iterate that element, and if you're looking for multiple elements, find_all instead of find would probably match the intent.
More concretely, here are the two approaches. Beyond what's been mentioned above, note that i is a div that contains the desired a as a child, so we need an extra step to reach it (this could be more direct with an xpath).
import requests
from bs4 import BeautifulSoup
url = "http://python.beispiel.programmierenlernen.io/index.php"
doc = requests.get(url).content
soup = BeautifulSoup(doc, "html.parser")
for i in soup.find_all("div", {"class": "navigation"}):
print(i.find("a", href=True)["href"])
print(soup.find("div", {"class": "navigation"})
.find("a", href=True)["href"])
Output:
index.php?page=2
index.php?page=2

Getting a specific part of a website with Beautiful Soup 4

I got the basics down of finding stuff with Beautiful Soup 4. However right now I am stuck with a specific problem.I want to scrape the "2DKT94P" from the data-oid of the below code:
<div class="js-object listitem_wrap " data-estateid="45784882" data-oid="2DKT94P">
<div class="listitem relative js-listitem ">
Any pointers on how I might do this? I would also appreciate a pointer for an advanced tutorial that covers this, and/or a link on where I would have been able to find this in the official documentation because I failed to recognize the correct part...
Thanks in advance!
you should locate the div tag using class attribute then get it's data-oid attribute
div = soup.find("div", class_="js-object")
oid = div['data-oid']
If your data is well formated you can do this via this way:
from bs4 import BeautifulSoup
example = """
<div class="js-object listitem_wrap " data-estateid="45784882" data-
oid="2DKT94P">
<div class="listitem relative js-listitem ">2DKT94P DIV</div>
</div>
<div>other div</div>"""
soup = BeautifulSoup(example, "html.parser")
RandomDIV = soup.find(attrs= {"data-oid":"2DKT94P"})
print (RandomDIV.get_text().strip())
Outputs:
2DKT94P DIV
Find more info about find or find_all with attributes here.
Or via select:
RandomDIV = soup.select("div[data-oid='2DKT94P']")
print (RandomDIV[0].get_text().strip())
Find more about select.
EDIT:
Totally misunderstood the question. If you want to search only for data-oid you can do like this:
soup = BeautifulSoup(example, "html.parser")
RandomDIV = soup.find_all(lambda tag: [t for t in tag.attrs if
t == 'data-oid'])
for div in RandomDIV:
#data-oid
print(div["data-oid"])
#text
print (div.text.strip())
Learn more here.

Scrapy Running Results

Just getting started with Scrapy, I'm hoping for a nudge in the right direction.
I want to scrape data from here:
https://www.sportstats.ca/display-results.xhtml?raceid=29360
This is what I have so far:
import scrapy
import re
class BlogSpider(scrapy.Spider):
name = 'sportstats'
start_urls = ['https://www.sportstats.ca/display-results.xhtml?raceid=29360']
def parse(self, response):
headings = []
results = []
tables = response.xpath('//table')
headings = list(tables[0].xpath('thead/tr/th/span/span/text()').extract())
rows = tables[0].xpath('tbody/tr[contains(#class, "ui-widget-content ui-datatable")]')
for row in rows:
result = []
tds = row.xpath('td')
for td in enumerate(tds):
if headings[td[0]].lower() == 'comp.':
content = None
elif headings[td[0]].lower() == 'view':
content = None
elif headings[td[0]].lower() == 'name':
content = td[1].xpath('span/a/text()').extract()[0]
else:
try:
content = td[1].xpath('span/text()').extract()[0]
except:
content = None
result.append(content)
results.append(result)
for result in results:
print(result)
Now I need to move on to the next page, which I can do in a browser by clicking the "right arrow" at the bottom, which I believe is the following li:
<li><a id="mainForm:j_idt369" href="#" class="ui-commandlink ui-widget fa fa-angle-right" onclick="PrimeFaces.ab({s:"mainForm:j_idt369",p:"mainForm",u:"mainForm:result_table mainForm:pageNav mainForm:eventAthleteDetailsDialog",onco:function(xhr,status,args){hideDetails('athlete-popup');showDetails('event-popup');scrollToTopOfElement('mainForm\\:result_table');;}});return false;"></a>
How can I get scrapy to follow that?
If you open the url in a browser without javascript you won't be able to move to the next page. As you can see inside the li tag, there is some javascript to be executed in order to get the next page.
Yo get around this, the first option is usually try to identify the request generated by javascript. In your case, it should be easy: just analyze the java script code and replicate it with python in your spider. If you can do that, you can send the same request from scrapy. If you can't do it, the next option is usually to use some package with javascript/browser emulation or someting like that. Something like ScrapyJS or Scrapy + Selenium.
You're going to need to perform a callback. Generate the url from the xpath from the 'next page' button. So url = response.xpath(xpath to next_page_button) and then when you're finished scraping that page you'll do yield scrapy.Request(url, callback=self.parse_next_page). Finally you create a new function called def parse_next_page(self, response):.
A final, final note is if it happens to be in Javascript (and you can't scrape it even if you're sure you're using the correct xpath) check out my repo in using splash with scrapy https://github.com/Liamhanninen/Scrape

Beautiful Soup - how to get href

I can't seem to be able to extract the href (there is only one <strong>Website:</strong> on the page) from the following soup of html:
<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>
This is what I thought should work
href = soup.find("strong" ,text=re.compile(r'Website')).next["href"]
.next in this case is a NavigableString containing the whitespace between the <strong> tag and the <a> tag. Also, the text= attribute is for matching NavigableStrings, rather than elements.
The following does what you want, I think:
import re
from BeautifulSoup import BeautifulSoup
html = '''<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>'''
soup = BeautifulSoup(html)
for t in soup.findAll(text=re.compile(r'Website:')):
# Find the parent of the NavigableString, and see
# whether that's a <strong>:
s = t.parent
if s.name == 'strong':
print s.nextSibling.nextSibling['href']
... but that isn't very robust. If the enclosing div has a predictable ID, then it would better to find that, and then find the first <a> element within it.