Getting a specific part of a website with Beautiful Soup 4 - beautifulsoup

I got the basics down of finding stuff with Beautiful Soup 4. However right now I am stuck with a specific problem.I want to scrape the "2DKT94P" from the data-oid of the below code:
<div class="js-object listitem_wrap " data-estateid="45784882" data-oid="2DKT94P">
<div class="listitem relative js-listitem ">
Any pointers on how I might do this? I would also appreciate a pointer for an advanced tutorial that covers this, and/or a link on where I would have been able to find this in the official documentation because I failed to recognize the correct part...
Thanks in advance!

you should locate the div tag using class attribute then get it's data-oid attribute
div = soup.find("div", class_="js-object")
oid = div['data-oid']

If your data is well formated you can do this via this way:
from bs4 import BeautifulSoup
example = """
<div class="js-object listitem_wrap " data-estateid="45784882" data-
oid="2DKT94P">
<div class="listitem relative js-listitem ">2DKT94P DIV</div>
</div>
<div>other div</div>"""
soup = BeautifulSoup(example, "html.parser")
RandomDIV = soup.find(attrs= {"data-oid":"2DKT94P"})
print (RandomDIV.get_text().strip())
Outputs:
2DKT94P DIV
Find more info about find or find_all with attributes here.
Or via select:
RandomDIV = soup.select("div[data-oid='2DKT94P']")
print (RandomDIV[0].get_text().strip())
Find more about select.
EDIT:
Totally misunderstood the question. If you want to search only for data-oid you can do like this:
soup = BeautifulSoup(example, "html.parser")
RandomDIV = soup.find_all(lambda tag: [t for t in tag.attrs if
t == 'data-oid'])
for div in RandomDIV:
#data-oid
print(div["data-oid"])
#text
print (div.text.strip())
Learn more here.

Related

I want to scrape the link from the html code below, but do not know how to do that because it is in the brackets

This is the html code from which I want to scrape the link of the youtube video. But I do not know how to do it, if anyone of you know this please answer me.
<button id='btnWatchLikeAndSubscribe' class='greenButton button' style='font-
size: 18px;'
onclick="newtab =openWin('http://www.youtube.com/watch?v=lZenDvvS5WM');
enableWatchTimer();">1.
<i class='fa fa-eye'></i>&nbsp Watch, Like & Subscribe</button>
You could use regular expression:
import re
match = re.search(r'''openWin\(('(?P<url>[^']*)')\)''',TEXT_OF_BUTTON)
url = match.groupdict().get('url') if match else None

Pandas web scraping(Beautiful soup) find in tag with class, another tag with a link. Then following the link inside href

I tried fins 'td' tag with specific attribute, and then find 'a' tag inside of the 'td' tag
for row in bs4.find_all('<td class="series-column"'):
for link in bs4.find_all('a'):
if link.has_attr('href') and (link.has_attr('class') == 'formatted-title external-link result-url'):
print(link.attrs['href'])
On the screenshot you see html for this page
Your bs4.find_all('<td class="series-column"') is wrong. You have to supply tag name and attributes you want to find, for example bs4.find_all('td', class_='series-column'). Or use CSS selector:
from bs4 import BeautifulSoup
txt = '''
<td class="series-column">
<a class="formatted-title external-link result-url" href="//knoema.com/...">link text</a>
</td>'''
soup = BeautifulSoup(txt, 'html.parser')
for link in soup.select('td.series-column a.formatted-title.external-link.result-url'):
print(link['href'])
Prints:
//knoema.com/...

Scrapy crawl web with many duplicated element class name

I'm new to the Scrapy and trying to crawl the web but the HTML element consist of many DIV that have duplicated class name eg.
<section class= "pi-item pi-smart-group pi-border-color">
<section class="pi-smart-group-head">
<h3 class = "pi-smart-data-label pi-data-label pi-secondary-font pi-item-spacing">
</section>
<section class= "pi-smart-group-body">
<div class="pi-smart-data-value pi-data-value pi-font pi-item-spacing">
</div>
</section>
</section>
My problem is that this structure repeat for many other element and when I'm using response.css I will get multiple element which I didn't want
(Basically I want to crawl the Pokemon information eg. "Types", "Species" and "Ability" of each Pokemon from https://pokemon.fandom.com/wiki/Bulbasaur , I have done get url for all Pokemon but stuck in getting information from each Pokemon)
I have tried to do this scrapy project for you and got the results. The issue I see is that you have used CSS. You can scrape with that, but it is far more effective to use Xpath selectors. You have more versatility to select the specific tags you want. Here is the code I wrote for you. Bare in mind, this code is just something I did quickly to get your results. It works but I did it in this way so it is easy for you understand it since you are new to scrapy. Please let me know if this is helpful
import scrapy
class PokemonSpiderSpider(scrapy.Spider):
name = 'pokemon_spider'
start_urls = ['https://pokemon.fandom.com/wiki/Bulbasaur']
def parse(self, response):
pokemon_type = response.xpath("(//div[#class='pi-data-value pi-font'])[1]/a/#title")
pokemon_species = response.xpath('//div[#data-source="species"]//div/text()')
pokemon_abilities = response.xpath('//div[#data-source="ability"]/div/a/text()')
yield {
'pokemon type': pokemon_type.extract(),
'pokemon species': pokemon_species.extract(),
'pokemon abilities': pokemon_abilities.extract()
}
You can use XPath expression with a property text:
abilities = response.xpath('//h3[a[.="Abilities"]]/following-sibling::div[1]/a/text()').getall()
species = response.xpath('//h3[a[.="Species"]]/following-sibling::div[1]/text()').get()

Getting inner tag text whilst using a filter in BeautifulSoup

I have:
... html
<div id="price">$199.00</div>
... html
How do I get the $199.00 text. Using
soup.findAll("div",id="price",text=True)
does not work as I get all the innet text from the whole document.
Find div tag, and use text attribute to get text inside the tag.
>>> from bs4 import BeautifulSoup
>>>
>>> html = '''
... <html>
... <body>
... <div id="price">$199.00</div>
... </body>
... </html>
... '''
>>> soup = BeautifulSoup(html)
>>> soup.find('div', id='price').text
u'$199.00'
You are SO close to make it work.
(1) How to search and locate the tag that you are interested:
Let's take a look at how to use find_all function:
find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs):...
name="div":The name attribute will contains the tag name
attrs={"id":"price"}: The attrs is a dictionary who contains the attribute
recursive: a flag whether dive into its children or not.
text: could be used along with regular expressions to search for tags which contains certain text
limit: is a flag to choose how many you want to return limit=1 make find_all the same as find
In your case, here are a list of commands to locate the tags playing with different flags:
>> # in case you have multiple interesting DIVs I am using find_all here
>> html = '''<html><body><div id="price">$199.00</div><div id="price">$205.00</div></body></html>'''
>> soup = BeautifulSoup(html)
>> print soup.find_all(attrs={"id":"price"})
[<div id="price">$199.00</div>, <div id="price">$205.00</div>]
>> # This is a bit funky but sometime using text is extremely helpful
>> # because text is actually what human will see so it is more reliable
>> import re
>> tags = [text.parent for text in soup.find_all(text=re.compile('\$'))]
>> print tags
[<div id="price">$199.00</div>, <div id="price">$205.00</div>]
There are many different ways to locate your elements and you just need to ask yourself, what will be the most reliable way to locate a element.
More Information about BS4 Find, click here.
(2) How to get the text of a tag:
tag.text will return unicode and you can convert to string type by using tag.text.encode('utf-8')
tag.string will also work.

Beautiful Soup - how to get href

I can't seem to be able to extract the href (there is only one <strong>Website:</strong> on the page) from the following soup of html:
<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>
This is what I thought should work
href = soup.find("strong" ,text=re.compile(r'Website')).next["href"]
.next in this case is a NavigableString containing the whitespace between the <strong> tag and the <a> tag. Also, the text= attribute is for matching NavigableStrings, rather than elements.
The following does what you want, I think:
import re
from BeautifulSoup import BeautifulSoup
html = '''<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>'''
soup = BeautifulSoup(html)
for t in soup.findAll(text=re.compile(r'Website:')):
# Find the parent of the NavigableString, and see
# whether that's a <strong>:
s = t.parent
if s.name == 'strong':
print s.nextSibling.nextSibling['href']
... but that isn't very robust. If the enclosing div has a predictable ID, then it would better to find that, and then find the first <a> element within it.