BeautifulSoup get_text() not returning all text - beautifulsoup

I'm trying to scrape from a forum site that is littered with random silly html gems, and am unable to get everything I need from it. Most of the content scrapes just fine (when the html isn't too ugly) but for the problematic posts, I'm only getting the text from the first tag. Can you help me figure out how to get all the text from the entire post? Thank you!
The HTML for the post looks something like this:
This is all contained in a tag from a
<div class="PostMessageBody">Mike,</div>
<div> </div>
<div>
"lorem ipsum, lorem ipsum"
</div>
<div> </div>
<div>
"lorem ipsum, lorem ipsum"
</div>
</div>
Most of the posts make use of the tag, but every once in a while, this gem pops up instead. When it does, the only part that gets scraped by BS is the first div, and the string "Mike," is the only part that gets scraped, however I am also needing to scrape the "lorem ipsum" content.
I'm using beautifulsoup and requests as follows (this is inside a loop that is reading content and writing rows to file, so I'm trying to extract the relevant pieces of code here):
import requests
import re as regex
from bs4 import BeautifulSoup
url = "http://www.example.com/community/default.aspx?f=35&m=5555555&p=3"
r = requests.get(url)
soup = BeautifulSoup(r.text)
posts = soup.find_all('table', class_='PostBox')
if not posts:
break
for k in posts:
tds = k.find_all('td')
userField = tds[0].get_text()
temp1 = regex.split(userField)
userName = temp1[0]
joinDate, postCount = temp1[-1].split("Total Posts : ")
post = str(tds[3].find('div', class_='PostMessageBody'))
postMessage, postSig = tds[3].find('div', class_='PostMessageBody').get_text(), ""
postMessage = postMessage.encode('ascii','ignore')
postSig = postSig.encode('ascii','ignore')
writer.writerow([site, userName, joinDate, postCount, postMessage, postSig])
Thanks for the help!

Related

Extracting text within div tag itself with BeautifulSoup

I am trying to extract the number eg. "3762" from the div below with Beautifulsoup:
<div class="contentBox">
<div class="pid-box-1" data-pid-imprintid="3762">
</div>
<div class="pid-box-2" data-pid-imprintid="5096">
</div>
<div class="pid-box-1" data-pid-imprintid="10944">
</div>
</div>
The div comes from this website (a pharma medical database): Drugs.com.
I can not use "class" since that changes from div to div, more than just pid-box-1 and pid-box-2. I haven't had success using the "data-pid-imprintid" either.
This is what i have tried and i know that i cant write "data-pid-imprintid" the way i have done:
soup = BeautifulSoup(html_text, 'lxml')
divs = soup.find_all('div', 'data-pid-imprintid')
for div in divs:
item = div.find('div')
id = item.get('data-pid-imprintid')
print (id)
This gets the value of data-pid-imprintid in every div with data-pid-imprintid
soup = BeautifulSoup(html_text, 'lxml')
divs = soup.find_all("div", attrs={"data-pid-imprintid": True})
for div in divs:
print(div.get('data-pid-imprintid'))
First at all be aware there is a little typo in your html (class="pid-box-1'), without fixing it, you will only get two ids back.
How to select?
As alternativ approache to find_all() that works well, you can also go with the css selector:
soup.select('div [data-pid-imprintid]')
These will select every <div> with an attribute called data-pid-imprintid. To get the value of data-pid-imprintid you have to iterate the result set for example by list comprehension:
[e['data-pid-imprintid'] for e in soup.select('div [data-pid-imprintid]')]
Example
import requests
from bs4 import BeautifulSoup
html='''<div class="contentBox">
<div class="pid-box-1" data-pid-imprintid="3762">
</div>
<div class="pid-box-2" data-pid-imprintid="5096">
</div>
<div class="pid-box-1" data-pid-imprintid="10944">
</div>
</div>'''
soup = BeautifulSoup(html, 'lxml')
ids = [e['data-pid-imprintid'] for e in soup.select('div [data-pid-imprintid]')]
print(ids)
Output
['3762', '5096', '10944']

Scrapy crawl web with many duplicated element class name

I'm new to the Scrapy and trying to crawl the web but the HTML element consist of many DIV that have duplicated class name eg.
<section class= "pi-item pi-smart-group pi-border-color">
<section class="pi-smart-group-head">
<h3 class = "pi-smart-data-label pi-data-label pi-secondary-font pi-item-spacing">
</section>
<section class= "pi-smart-group-body">
<div class="pi-smart-data-value pi-data-value pi-font pi-item-spacing">
</div>
</section>
</section>
My problem is that this structure repeat for many other element and when I'm using response.css I will get multiple element which I didn't want
(Basically I want to crawl the Pokemon information eg. "Types", "Species" and "Ability" of each Pokemon from https://pokemon.fandom.com/wiki/Bulbasaur , I have done get url for all Pokemon but stuck in getting information from each Pokemon)
I have tried to do this scrapy project for you and got the results. The issue I see is that you have used CSS. You can scrape with that, but it is far more effective to use Xpath selectors. You have more versatility to select the specific tags you want. Here is the code I wrote for you. Bare in mind, this code is just something I did quickly to get your results. It works but I did it in this way so it is easy for you understand it since you are new to scrapy. Please let me know if this is helpful
import scrapy
class PokemonSpiderSpider(scrapy.Spider):
name = 'pokemon_spider'
start_urls = ['https://pokemon.fandom.com/wiki/Bulbasaur']
def parse(self, response):
pokemon_type = response.xpath("(//div[#class='pi-data-value pi-font'])[1]/a/#title")
pokemon_species = response.xpath('//div[#data-source="species"]//div/text()')
pokemon_abilities = response.xpath('//div[#data-source="ability"]/div/a/text()')
yield {
'pokemon type': pokemon_type.extract(),
'pokemon species': pokemon_species.extract(),
'pokemon abilities': pokemon_abilities.extract()
}
You can use XPath expression with a property text:
abilities = response.xpath('//h3[a[.="Abilities"]]/following-sibling::div[1]/a/text()').getall()
species = response.xpath('//h3[a[.="Species"]]/following-sibling::div[1]/text()').get()

Getting a specific part of a website with Beautiful Soup 4

I got the basics down of finding stuff with Beautiful Soup 4. However right now I am stuck with a specific problem.I want to scrape the "2DKT94P" from the data-oid of the below code:
<div class="js-object listitem_wrap " data-estateid="45784882" data-oid="2DKT94P">
<div class="listitem relative js-listitem ">
Any pointers on how I might do this? I would also appreciate a pointer for an advanced tutorial that covers this, and/or a link on where I would have been able to find this in the official documentation because I failed to recognize the correct part...
Thanks in advance!
you should locate the div tag using class attribute then get it's data-oid attribute
div = soup.find("div", class_="js-object")
oid = div['data-oid']
If your data is well formated you can do this via this way:
from bs4 import BeautifulSoup
example = """
<div class="js-object listitem_wrap " data-estateid="45784882" data-
oid="2DKT94P">
<div class="listitem relative js-listitem ">2DKT94P DIV</div>
</div>
<div>other div</div>"""
soup = BeautifulSoup(example, "html.parser")
RandomDIV = soup.find(attrs= {"data-oid":"2DKT94P"})
print (RandomDIV.get_text().strip())
Outputs:
2DKT94P DIV
Find more info about find or find_all with attributes here.
Or via select:
RandomDIV = soup.select("div[data-oid='2DKT94P']")
print (RandomDIV[0].get_text().strip())
Find more about select.
EDIT:
Totally misunderstood the question. If you want to search only for data-oid you can do like this:
soup = BeautifulSoup(example, "html.parser")
RandomDIV = soup.find_all(lambda tag: [t for t in tag.attrs if
t == 'data-oid'])
for div in RandomDIV:
#data-oid
print(div["data-oid"])
#text
print (div.text.strip())
Learn more here.

Using Scrapy to scrape data after form submit

I'm trying to scrape content from listing detail page that can only be viewed by clicking the 'view' button which triggers a form submit . I am new to both Python and Scrapy
Example markup
<li><h3>Abc Widgets</h3>
<form action="/viewlisting?id=123" method="post">
<input type="image" src="/images/view.png" value="submit" >
</form>
</li>
My solution in Scrapy is to extract form actions then use Request to return the page with a callback to parse it for for the desired content. However I have hit a few issues
I'm getting the following error "request url must be str or unicode"
secondly when I hardcode a URL to overcome the above issue it seems my parsing function is returning what looks like a list
Here is my code - with reactions of the real URLs
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from wfi2.items import Wfi2Item
class ProfileSpider(Spider):
name = "profiles"
allowed_domains = ["wfi.com.au"]
start_urls = ["http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=WA",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=VIC",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=QLD",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=NSW",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=TAS"
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=NT"
]
def parse(self, response):
hxs = Selector(response)
forms = hxs.xpath('//*[#id="area-managers"]//*/form')
for form in forms:
action = form.xpath('#action').extract()
print "ACTION: ", action
#request = Request(url=action,callback=self.parse_profile)
request = Request(url=action,callback=self.parse_profile)
yield request
def parse_profile(self, response):
hxs = Selector(response)
profile = hxs.xpath('//*[#class="contentContainer"]/*/text()')
print "PROFILE", profile
I'm getting the following error "request url must be str or unicode"
Please have a look at the scrapy documentation for extract(). It says : "Serialize and return the matched nodes as a list of unicode strings" (bold added by me).
The first element of the list is probably what you want. So you could do something like:
request = Request(url=response.urljoin(action[0]), callback=self.parse_profile)
secondly when I hardcode a URL to overcome the above issue it seems my
parsing function is returning what looks like a list
According to the documentation of xpath it's a SelectorList. Add extract() to the xpath and you'll get a list with the text tokens. Eventually you want to clean up and join the elements that list before further processing.

Beautiful Soup - how to get href

I can't seem to be able to extract the href (there is only one <strong>Website:</strong> on the page) from the following soup of html:
<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>
This is what I thought should work
href = soup.find("strong" ,text=re.compile(r'Website')).next["href"]
.next in this case is a NavigableString containing the whitespace between the <strong> tag and the <a> tag. Also, the text= attribute is for matching NavigableStrings, rather than elements.
The following does what you want, I think:
import re
from BeautifulSoup import BeautifulSoup
html = '''<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>'''
soup = BeautifulSoup(html)
for t in soup.findAll(text=re.compile(r'Website:')):
# Find the parent of the NavigableString, and see
# whether that's a <strong>:
s = t.parent
if s.name == 'strong':
print s.nextSibling.nextSibling['href']
... but that isn't very robust. If the enclosing div has a predictable ID, then it would better to find that, and then find the first <a> element within it.