So I've read through all the questions about findAll() not working that I can find, and the answer always seems to be an issue with the particular html parser. I have run the following code using the default 'html.parser' along with 'lxml' and 'html5lib' yet I can only find one instance when I should be finding 14.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://robertsspaceindustries.com/pledge/ships'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, features = "lxml")
containers = page_soup.findAll("ul", {"class":"ships-listing"})
len(containers)
I tried posting a picture of the HTML code, but I don't have enough reputation. Here is a link to the image (https://imgur.com/a/cOLO8GO).
When you download a page through urllib (or requests HTTP library) it downloads the original HTML source file.
Initially there's only sinlge tag with the class name 'ships-listing' because that tag comes with the source page. But once you scroll down, the page generates additional <ul class='ships-listing'> and these elements are generated by the JavaScript.
So when you download a page using urllib, the downloaded content only contains the original source page (you could see it by view-source option in the browser).
Related
import requests
from bs4 import BeautifulSoup as soup
my_url='http://cvr.ac.in/home4/index.php/eee-sp-870859316/eeefaculty'
page=requests.get(my_url)
data=page.text
page_soup=soup(data,'html.parser')
cont=page_soup.select("div",{"class": "item-page"})
print(cont)
I am trying to scrape the faculty details name, designation , profile into a csv file .
when I use above code it throws empty [].
any help greatly appreciated.
The page is looking for any of a defined set of valid user agents. For example,
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://cvr.ac.in/home4/index.php/eee-sp-870859316/eeefaculty', headers = {'User-Agent': 'Chrome/80.0.3987.163'})
soup = bs(r.content, 'lxml')
print(soup.select('.item-page'))
Without that, you get an 406 response and the classes you are looking for in the html are not present.
I am a new user to Beautiful Soup and am trying to create a baby application that retrieves the view count from a YouTube url.
So, I looked at the BS docs and I saw that you could retrieve items by their id. So I attempted to retrieve the info id - but whenever I attempt to do this, it comes out as "None", so it must not be finding the id.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/watch?v=bUPvE5yv72I'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
divisions = soup.findAll("div")
print(divisions[0])
info = soup.find(id="info")
print(info)
You can try to search for meta itemprop="interactionCount" tag, but this value can be often not exact. Best way is using the official YouTube API:
import requests
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/watch?v=bUPvE5yv72I'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
print(soup.select_one('meta[itemprop="interactionCount"][content]')['content'])
Prints:
165011
I know versions of this question were asked in the past, but I'm still confused and would like to settle my doubts once and for all, if possible.
If I use
from bs4 import BeautifulSoup
my soup assignment is going to be
soup = BeautifulSoup(html, "lxml")
If I do the importing thus:
from bs4 import BeautifulSoup as bs4
my soup assignment is
soup = bs4(html, "lxml")
Finally, if I import using:
import bs4
my soup assignment is
soup = bs4.BeautifulSoup(html, "lxml")
Let's use a simple html and code:
html = """
Some Document
"""
link = soup.select('a:contains(Document)')
Next, the main question:
type(link[0])
The output - in all three import cases - is:
bs4.element.Tag
But if I ask:
isinstance(link[0],bs4.element.Tag)
In the third case, I get True, but in the first two cases, I get
AttributeError: type object 'BeautifulSoup' has no attribute 'element'
Since the select() and find_all() methods frequently deliver bothTag or NavigableString results, I need to determine which is which using, for example, isinstance(). So in those cases, do I have to use the third import method? Why is there a difference in the first place?
This is a naming game you are doing. Lets go ahead and state that class bs4.element.Tag is the class of element instances. Think of that as the absolute location of the Tag class in bs4. bs4.element represents the nested modules with Tag (which is found under the element module) being the class in which the elements are instances of. When displaying the class info of those elements, it will always show bs4.element.Tag.
Now, with all of that said, you can access the BeautifulSoup object in different ways. And none of this changes the fact that element tags are of type bs4.element.Tag. When you import bs4:
import bs4
bs4.BeautifulSoup()
This imports the module under the module's default name bs4. And then you can access BeautifulSoup in that module with the dot notation as BeautifulSoup is a member of that module. But locally bs4 is just a variable that references the bs4 module.
When you import as:
from bs4 import BeautifulSoup as bs4
bs4 does not mean the same thing as the first example. In the first example we imported the entire module under its default name (bs4), but here we instead import the BeautifulSoup class and rename it locally as bs4. Regardless of what we call it locally, it is still a class at bs4.BeautifulSoup, where bs4 is the module name. Locally though (local to this file), we created a variable reference to the BeautifulSoup class with a name that happens to be the same as the module.
So, when you use select to return elements, they are of the type bs4.element.Tag. This is true regardless of what your local variables happen to be named. This is internally how they are known.
So, when comparing instance, it is important to know, the variable name is not important, what is important is what the variable is referencing. In the third example, import bs4 causes bs4 to reference the bs4 module; therefore, Tag can be accessed at bs4.element.Tag. But in the case where you use from bs4 import BeautifulSoup as bs4, bs4 no longer references the bs4 module, it references the BeautifulSoup class which has no attributes called element with the attribute Tag as it is not a module but a class.
The local name is just how your current file is referencing the object it refers to.
So in your failing cases, you would need to import the Tag reference to a variable you can provide to instance:
>>> from bs4 import BeautifulSoup
>>> from bs4.element import Tag
>>> soup = bs4.BeautifulSoup('<div>Test<span>test</span><span>test2</span></div>')
>>> isinstance(soup.find('div'), Tag)
True
Tag here is just a name, but it references bs4.element.Tag, so it works.
We could call it anything and it will still work as long as it references the correct object:
>>> from bs4 import BeautifulSoup
>>> from bs4.element import Tag as Apple
>>> soup = bs4.BeautifulSoup('<div>Test<span>test</span><span>test2</span></div>')
>>> isinstance(soup.find('div'), Apple)
True
Hopefully that makes more sense :).
EDIT: Just a tip, but bs4 makes some references to things like NavigableString and Tag available in the top level module, so you don't have to reach all the way down to bs4.element to get a proper reference, you can simply do:
from bs4 import Tag, NavigableString
Again, this alternative reference of bs4.Tag is just a variable named Tag in the bs4 module that refers to the actual bs4.element.Tag class. You can use that, and it will still refer to the same class. It is just used locally in the bs4 module to reference the Tag class in element.
This is an example from a python book. When I run it I don't get any output. Can someone help me? Thanks!!!
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
text = urlopen('https://python.org/community/jobs').read()
soup = BeautifulSoup(text)
jobs = set()
for header in soup('h3'):
links = header('a', 'reference')
if not links: continue
link = links[0]
jobs.add('%s (%s)' % (link.string, link['href']))
print jobs.add('%s (%s)' % (link.string, link['href']))
print '\n'.join(sorted(jobs, key=lambda s: s.lower()))
reedit--
firstly,i only considered the url is wrong but ignore the html infomation i wanna to get was not exist. May be this is why i get empty output.
If you open the page and inspect the html you'll notice there are no <h3> tags containing links. This is why you have no output.
So if not links: continue always continues.
This is probably because the page has moved to https://www.python.org/jobs/ so the <h3> tags containing links on the page are no longer present.
If you point this code's url to the new page. I'd suggest using taking some time to familiarize yourself with the page source. For instance it uses <h2> instead of <h3> tags for its links.
I am trying to write a small script in Python to help me through some of the more tedious parts of my job. I wrote this:
from bs4 import BeautifulSoup
import lxml
import os
import sys
questid = "1478"
soup = BeautifulSoup(open("Data/xmls/quests.xml"), "lxml")
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
questSoup = BeautifulSoup(quest)
for floor in questSoup.find_all('location_id'):
print(floor)
What this is supposed to do is to get a part of a huge xml called "quests", based on tag - and its attribute - "id". Then it is supposed to make a new soup from that part and get all the tags from within the . For now, before I figure out which quest ids I want to choose (and how will I handle input), I just hardcoded one ("1478").
The script so far prints the quest, but fails to create a new soup from it.
Is it possible that the quest variable is not a string? Or am I doing something wrong?
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
# questSoup = BeautifulSoup(quest)
for floor in quest.find_all('location_id'):
print(floor)
No need to build a new soup object from tag object, you can use find_all on both of them, as both are navigable strings, so they behave in the same way and can be accessed in the same way.
In my opinion, soup object is special tag object which is named document
import requests, bs4
r =requests.get('http://www.google.com')
soup = bs4.BeautifulSoup(r.text, 'lxml')
soup.name
out:
'[document]'