How to explain the following Beautiful Soup code? - beautifulsoup

i am new to Beautiful Soup and trying to learn it, while i am learning it , i got stuck at a certain code.Below is the code:
for link in bsObj.findAll("a"):
if 'href' in link.attrs:
print(link.attrs['href'])
I am unable to understand the meaning of ".attrs" how to use it and what does it do?
Secondly, when i execute this code it prints all the links but omits href as a variable from it.? what is going on? can someone please explain it to me?
Below is the complete code :
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a"):
if 'href' in link.attrs:
print(link.attrs['href'])

From the documentation of beautifulsoup4, "attrs" refers all attributes with value of a html tag. For "a" tag, it may have "href" attribute, "class attribute" etc. attrs return a dictionary, so you'll get the value by accessing the key "href". For example: when it prints the following link:
"/wiki/Wikipedia:Protection_policy#semi"
then, dictionary["href"] = "/wiki/Wikipedia:Protection_policy#semi"
so the value for the key "href" is "/wiki/Wikipedia:Protection_policy#semi"
Just write the following code:
print(link.attrs) then everything about this will clear to you.

Related

Beautifulsoup Scraping Object Selection

I am trying to scrape by Beautifulsoup and new to this, I need table rows as you see enter image description here.
The tables are coming from reactapp and then shown on the website. I need suggestion how to do this. I am struggling to create the beautifulsoup object and do not know what the actual class to grap to reach table rows and their content.
webpage = urlopen(req).read()
soup = bs(webpage, "html.parser")
table=soup.find('table', {'class': 'equity'})
rows=list()
for row in table.findAll("tr"):
rows.append(row)
Need your help, really appreciated, having hard time to get it done!
You can grab the td elements with this code:
webpage = urlopen(req).read()
soup = bs(webpage, "lxml")
table=soup.find('table', {'class': 'table'}).find('tr')
rows=list()
for row in table.findAll("td"):
rows.append(row)
I prefered using lxml as the parser because it has some advantages, but you can keep using html.parser
You can also use pandas, It will create, It's so much easier to learn from its documentaion (there is a lot).

Python BeautifulSoup get text from class

How can I get the text "Lionel Messi" from this HTML code?
Lionel Messi
This is my code so far:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
page = requests.get('https://www.futbin.com/players')
soup = BeautifulSoup(page.content, 'lxml')
pool = soup.find(id='repTb')
player_names = pool.find_all(class_='player_name_players_table')
print(player_names[0])
When I print player_names I get this result:
/Users/ejps/PycharmProjects/scraper_players/venv/bin/python /Users/ejps/PycharmProjects/scraper_players/scraper.py
<a class="player_name_players_table" href="/20/player/44079/lionel-messi">Lionel Messi</a>
Process finished with exit code 0
But what code would I have to put in to get only the text of it?
I want to scrape all player names form that page in my code. But first I need to find a way to get that text extracted I think.
Cant find a way to make it work unfortunately.
I am new to python and try to do some projects to learn it.
EDIT:
With the help from comments I was able to get the text I need.
I only have one more question here.
Is it possible to find class_ by partial text only?
Like this:
prating = pool.find_all(class_='form rating ut20')
The full class would be
class="form rating ut20 toty gold rare"
but it is changing. The part that is always the same is "form rating ut20" so I thought maybe there is some kind of a placeholder that let me search for all "class" names inlcuding "form rating ut20"
Could you maybe help me with this as well?
To select specific class you can use either regular expression or if you have installed version bs4 4.7.1 or above you can use css selector.
Using regular expression will get list of element.
import re
prating = pool.find_all(class_=re.compile("form rating ut20"))
Or Using css selector will get list of element.1st css selector means contains and other one means starts-with.
prating = pool.select('[class*="form rating ut20"]')
OR
prating = pool.select('[class^="form rating ut20"]')
Get text using the getText() method.
player_names[0].getText()

Confused about bs4 imorting methods and their effect on attributes

I know versions of this question were asked in the past, but I'm still confused and would like to settle my doubts once and for all, if possible.
If I use
from bs4 import BeautifulSoup
my soup assignment is going to be
soup = BeautifulSoup(html, "lxml")
If I do the importing thus:
from bs4 import BeautifulSoup as bs4
my soup assignment is
soup = bs4(html, "lxml")
Finally, if I import using:
import bs4
my soup assignment is
soup = bs4.BeautifulSoup(html, "lxml")
Let's use a simple html and code:
html = """
Some Document
"""
link = soup.select('a:contains(Document)')
Next, the main question:
type(link[0])
The output - in all three import cases - is:
bs4.element.Tag
But if I ask:
isinstance(link[0],bs4.element.Tag)
In the third case, I get True, but in the first two cases, I get
AttributeError: type object 'BeautifulSoup' has no attribute 'element'
Since the select() and find_all() methods frequently deliver bothTag or NavigableString results, I need to determine which is which using, for example, isinstance(). So in those cases, do I have to use the third import method? Why is there a difference in the first place?
This is a naming game you are doing. Lets go ahead and state that class bs4.element.Tag is the class of element instances. Think of that as the absolute location of the Tag class in bs4. bs4.element represents the nested modules with Tag (which is found under the element module) being the class in which the elements are instances of. When displaying the class info of those elements, it will always show bs4.element.Tag.
Now, with all of that said, you can access the BeautifulSoup object in different ways. And none of this changes the fact that element tags are of type bs4.element.Tag. When you import bs4:
import bs4
bs4.BeautifulSoup()
This imports the module under the module's default name bs4. And then you can access BeautifulSoup in that module with the dot notation as BeautifulSoup is a member of that module. But locally bs4 is just a variable that references the bs4 module.
When you import as:
from bs4 import BeautifulSoup as bs4
bs4 does not mean the same thing as the first example. In the first example we imported the entire module under its default name (bs4), but here we instead import the BeautifulSoup class and rename it locally as bs4. Regardless of what we call it locally, it is still a class at bs4.BeautifulSoup, where bs4 is the module name. Locally though (local to this file), we created a variable reference to the BeautifulSoup class with a name that happens to be the same as the module.
So, when you use select to return elements, they are of the type bs4.element.Tag. This is true regardless of what your local variables happen to be named. This is internally how they are known.
So, when comparing instance, it is important to know, the variable name is not important, what is important is what the variable is referencing. In the third example, import bs4 causes bs4 to reference the bs4 module; therefore, Tag can be accessed at bs4.element.Tag. But in the case where you use from bs4 import BeautifulSoup as bs4, bs4 no longer references the bs4 module, it references the BeautifulSoup class which has no attributes called element with the attribute Tag as it is not a module but a class.
The local name is just how your current file is referencing the object it refers to.
So in your failing cases, you would need to import the Tag reference to a variable you can provide to instance:
>>> from bs4 import BeautifulSoup
>>> from bs4.element import Tag
>>> soup = bs4.BeautifulSoup('<div>Test<span>test</span><span>test2</span></div>')
>>> isinstance(soup.find('div'), Tag)
True
Tag here is just a name, but it references bs4.element.Tag, so it works.
We could call it anything and it will still work as long as it references the correct object:
>>> from bs4 import BeautifulSoup
>>> from bs4.element import Tag as Apple
>>> soup = bs4.BeautifulSoup('<div>Test<span>test</span><span>test2</span></div>')
>>> isinstance(soup.find('div'), Apple)
True
Hopefully that makes more sense :).
EDIT: Just a tip, but bs4 makes some references to things like NavigableString and Tag available in the top level module, so you don't have to reach all the way down to bs4.element to get a proper reference, you can simply do:
from bs4 import Tag, NavigableString
Again, this alternative reference of bs4.Tag is just a variable named Tag in the bs4 module that refers to the actual bs4.element.Tag class. You can use that, and it will still refer to the same class. It is just used locally in the bs4 module to reference the Tag class in element.

why the code of python2.7 have no any output?

This is an example from a python book. When I run it I don't get any output. Can someone help me? Thanks!!!
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
text = urlopen('https://python.org/community/jobs').read()
soup = BeautifulSoup(text)
jobs = set()
for header in soup('h3'):
links = header('a', 'reference')
if not links: continue
link = links[0]
jobs.add('%s (%s)' % (link.string, link['href']))
print jobs.add('%s (%s)' % (link.string, link['href']))
print '\n'.join(sorted(jobs, key=lambda s: s.lower()))
reedit--
firstly,i only considered the url is wrong but ignore the html infomation i wanna to get was not exist. May be this is why i get empty output.
If you open the page and inspect the html you'll notice there are no <h3> tags containing links. This is why you have no output.
So if not links: continue always continues.
This is probably because the page has moved to https://www.python.org/jobs/ so the <h3> tags containing links on the page are no longer present.
If you point this code's url to the new page. I'd suggest using taking some time to familiarize yourself with the page source. For instance it uses <h2> instead of <h3> tags for its links.

Beautiful Soup NoneType error

I am trying to write a small script in Python to help me through some of the more tedious parts of my job. I wrote this:
from bs4 import BeautifulSoup
import lxml
import os
import sys
questid = "1478"
soup = BeautifulSoup(open("Data/xmls/quests.xml"), "lxml")
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
questSoup = BeautifulSoup(quest)
for floor in questSoup.find_all('location_id'):
print(floor)
What this is supposed to do is to get a part of a huge xml called "quests", based on tag - and its attribute - "id". Then it is supposed to make a new soup from that part and get all the tags from within the . For now, before I figure out which quest ids I want to choose (and how will I handle input), I just hardcoded one ("1478").
The script so far prints the quest, but fails to create a new soup from it.
Is it possible that the quest variable is not a string? Or am I doing something wrong?
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
# questSoup = BeautifulSoup(quest)
for floor in quest.find_all('location_id'):
print(floor)
No need to build a new soup object from tag object, you can use find_all on both of them, as both are navigable strings, so they behave in the same way and can be accessed in the same way.
In my opinion, soup object is special tag object which is named document
import requests, bs4
r =requests.get('http://www.google.com')
soup = bs4.BeautifulSoup(r.text, 'lxml')
soup.name
out:
'[document]'