why the code of python2.7 have no any output? - beautifulsoup

This is an example from a python book. When I run it I don't get any output. Can someone help me? Thanks!!!
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
text = urlopen('https://python.org/community/jobs').read()
soup = BeautifulSoup(text)
jobs = set()
for header in soup('h3'):
links = header('a', 'reference')
if not links: continue
link = links[0]
jobs.add('%s (%s)' % (link.string, link['href']))
print jobs.add('%s (%s)' % (link.string, link['href']))
print '\n'.join(sorted(jobs, key=lambda s: s.lower()))
reedit--
firstly,i only considered the url is wrong but ignore the html infomation i wanna to get was not exist. May be this is why i get empty output.

If you open the page and inspect the html you'll notice there are no <h3> tags containing links. This is why you have no output.
So if not links: continue always continues.
This is probably because the page has moved to https://www.python.org/jobs/ so the <h3> tags containing links on the page are no longer present.
If you point this code's url to the new page. I'd suggest using taking some time to familiarize yourself with the page source. For instance it uses <h2> instead of <h3> tags for its links.

Related

Python BeautifulSoup get text from class

How can I get the text "Lionel Messi" from this HTML code?
Lionel Messi
This is my code so far:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
page = requests.get('https://www.futbin.com/players')
soup = BeautifulSoup(page.content, 'lxml')
pool = soup.find(id='repTb')
player_names = pool.find_all(class_='player_name_players_table')
print(player_names[0])
When I print player_names I get this result:
/Users/ejps/PycharmProjects/scraper_players/venv/bin/python /Users/ejps/PycharmProjects/scraper_players/scraper.py
<a class="player_name_players_table" href="/20/player/44079/lionel-messi">Lionel Messi</a>
Process finished with exit code 0
But what code would I have to put in to get only the text of it?
I want to scrape all player names form that page in my code. But first I need to find a way to get that text extracted I think.
Cant find a way to make it work unfortunately.
I am new to python and try to do some projects to learn it.
EDIT:
With the help from comments I was able to get the text I need.
I only have one more question here.
Is it possible to find class_ by partial text only?
Like this:
prating = pool.find_all(class_='form rating ut20')
The full class would be
class="form rating ut20 toty gold rare"
but it is changing. The part that is always the same is "form rating ut20" so I thought maybe there is some kind of a placeholder that let me search for all "class" names inlcuding "form rating ut20"
Could you maybe help me with this as well?
To select specific class you can use either regular expression or if you have installed version bs4 4.7.1 or above you can use css selector.
Using regular expression will get list of element.
import re
prating = pool.find_all(class_=re.compile("form rating ut20"))
Or Using css selector will get list of element.1st css selector means contains and other one means starts-with.
prating = pool.select('[class*="form rating ut20"]')
OR
prating = pool.select('[class^="form rating ut20"]')
Get text using the getText() method.
player_names[0].getText()

Unable to pare the href tag in python

I get the following output in my beautiful soup.
[Search over 301,944 datasets\n]
I need to extract only the number 301,944 in this. Please guide me how this can be done. My code so far
import requests
import re
from bs4 import BeautifulSoup
source = requests.get('https://www.data.gov/').text
soup = BeautifulSoup (source , 'lxml')
#print soup.prettify()
images = soup.find_all('small')
print images
con = images.find_all('a') // I am unable to get anchor tag here. It says anchor tag not present
print con
#for con in images.find_all('a',href=True):
#print con
#content = images.split('metrics')
#print content[1]
#images = soup.find_all('a', {'href':re.compile('\d+')})
#print images
There is only one <small> tag on website.
Your images variable references it. But you use it in a wrong way to retrive anchor tag.
If you want to retrieve text from a tag you can get it with:
soup.find('small').a.text
where find method returns first small element it encounters on website. If you use find_all, you will get list of all small elements (but there's only one small tag here).

BeautifulSoup findAll() not finding all, regardless of which parser I use

So I've read through all the questions about findAll() not working that I can find, and the answer always seems to be an issue with the particular html parser. I have run the following code using the default 'html.parser' along with 'lxml' and 'html5lib' yet I can only find one instance when I should be finding 14.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://robertsspaceindustries.com/pledge/ships'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, features = "lxml")
containers = page_soup.findAll("ul", {"class":"ships-listing"})
len(containers)
I tried posting a picture of the HTML code, but I don't have enough reputation. Here is a link to the image (https://imgur.com/a/cOLO8GO).
When you download a page through urllib (or requests HTTP library) it downloads the original HTML source file.
Initially there's only sinlge tag with the class name 'ships-listing' because that tag comes with the source page. But once you scroll down, the page generates additional <ul class='ships-listing'> and these elements are generated by the JavaScript.
So when you download a page using urllib, the downloaded content only contains the original source page (you could see it by view-source option in the browser).

How to explain the following Beautiful Soup code?

i am new to Beautiful Soup and trying to learn it, while i am learning it , i got stuck at a certain code.Below is the code:
for link in bsObj.findAll("a"):
if 'href' in link.attrs:
print(link.attrs['href'])
I am unable to understand the meaning of ".attrs" how to use it and what does it do?
Secondly, when i execute this code it prints all the links but omits href as a variable from it.? what is going on? can someone please explain it to me?
Below is the complete code :
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a"):
if 'href' in link.attrs:
print(link.attrs['href'])
From the documentation of beautifulsoup4, "attrs" refers all attributes with value of a html tag. For "a" tag, it may have "href" attribute, "class attribute" etc. attrs return a dictionary, so you'll get the value by accessing the key "href". For example: when it prints the following link:
"/wiki/Wikipedia:Protection_policy#semi"
then, dictionary["href"] = "/wiki/Wikipedia:Protection_policy#semi"
so the value for the key "href" is "/wiki/Wikipedia:Protection_policy#semi"
Just write the following code:
print(link.attrs) then everything about this will clear to you.

Beautiful Soup NoneType error

I am trying to write a small script in Python to help me through some of the more tedious parts of my job. I wrote this:
from bs4 import BeautifulSoup
import lxml
import os
import sys
questid = "1478"
soup = BeautifulSoup(open("Data/xmls/quests.xml"), "lxml")
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
questSoup = BeautifulSoup(quest)
for floor in questSoup.find_all('location_id'):
print(floor)
What this is supposed to do is to get a part of a huge xml called "quests", based on tag - and its attribute - "id". Then it is supposed to make a new soup from that part and get all the tags from within the . For now, before I figure out which quest ids I want to choose (and how will I handle input), I just hardcoded one ("1478").
The script so far prints the quest, but fails to create a new soup from it.
Is it possible that the quest variable is not a string? Or am I doing something wrong?
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
# questSoup = BeautifulSoup(quest)
for floor in quest.find_all('location_id'):
print(floor)
No need to build a new soup object from tag object, you can use find_all on both of them, as both are navigable strings, so they behave in the same way and can be accessed in the same way.
In my opinion, soup object is special tag object which is named document
import requests, bs4
r =requests.get('http://www.google.com')
soup = bs4.BeautifulSoup(r.text, 'lxml')
soup.name
out:
'[document]'