I am trying to write a small script in Python to help me through some of the more tedious parts of my job. I wrote this:
from bs4 import BeautifulSoup
import lxml
import os
import sys
questid = "1478"
soup = BeautifulSoup(open("Data/xmls/quests.xml"), "lxml")
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
questSoup = BeautifulSoup(quest)
for floor in questSoup.find_all('location_id'):
print(floor)
What this is supposed to do is to get a part of a huge xml called "quests", based on tag - and its attribute - "id". Then it is supposed to make a new soup from that part and get all the tags from within the . For now, before I figure out which quest ids I want to choose (and how will I handle input), I just hardcoded one ("1478").
The script so far prints the quest, but fails to create a new soup from it.
Is it possible that the quest variable is not a string? Or am I doing something wrong?
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
# questSoup = BeautifulSoup(quest)
for floor in quest.find_all('location_id'):
print(floor)
No need to build a new soup object from tag object, you can use find_all on both of them, as both are navigable strings, so they behave in the same way and can be accessed in the same way.
In my opinion, soup object is special tag object which is named document
import requests, bs4
r =requests.get('http://www.google.com')
soup = bs4.BeautifulSoup(r.text, 'lxml')
soup.name
out:
'[document]'
Related
I finally got around to displaying the page that I need in text/HTML and did conclude that the data I need is also included. For now I just have it printing the entire page because I remain conflicted between the two elements that I potentially need to get what I want. Between these three highlighted elements 1, 2, and 3, I am having trouble with first identifying which one I should reference (I would go with the 'table' element but it doesn't highlight the left most column with ticker names which is literally half the point of getting this data, though the name is referenced like so as shown in the highlighted yellow part). Also, the class descriptions seem really long and and sometimes appears to have two within the same elements so I was wondering how I would address that? And though this problem is not as immediate, if you did take that code and just printed it and scrolled a bit down, the table data is in straight columns so I was wondering if that would be addressed after I reference the proper element or have to write something additional to fix it? Would the fact that I have multiple pages to scan also change anything in the code? Thank you in advance!
Code:
!pip install selenium
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome("D:/chromedriver/chromedriver.exe")
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
soup = BeautifulSoup(driver.page_source, 'html.parser')
# get text
text = soup.get_text()
print(text)
edit
read_html without bs4
You wont need beautifulsoup to get your goal, pandas is selecting all html tables from the page source and push them into a list of data frames.
In your case there is only one table in the page source, so you get your df by selecting the first element in list by slicing with [0]:
df = pd.read_html(driver.page_source)[0]
Example
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome('D:/chromedriver/chromedriver.exe')
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
df = pd.read_html(driver.page_source)[0]
driver.close()
Initial answer based on bs4
Your close to a solution, let pandas take control and read the html prettified and bs4 flavored to pandas and modify it there to your needs:
pd.read_html(soupt_one('table').prettify(), flavor='bs4')
Example
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome('D:/chromedriver/chromedriver.exe')
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
soup = BeautifulSoup(driver.page_source, 'html.parser')
df = pd.read_html(soup.select_one('table').prettify(), flavor='bs4')[0]
df
I am trying to scrape by Beautifulsoup and new to this, I need table rows as you see enter image description here.
The tables are coming from reactapp and then shown on the website. I need suggestion how to do this. I am struggling to create the beautifulsoup object and do not know what the actual class to grap to reach table rows and their content.
webpage = urlopen(req).read()
soup = bs(webpage, "html.parser")
table=soup.find('table', {'class': 'equity'})
rows=list()
for row in table.findAll("tr"):
rows.append(row)
Need your help, really appreciated, having hard time to get it done!
You can grab the td elements with this code:
webpage = urlopen(req).read()
soup = bs(webpage, "lxml")
table=soup.find('table', {'class': 'table'}).find('tr')
rows=list()
for row in table.findAll("td"):
rows.append(row)
I prefered using lxml as the parser because it has some advantages, but you can keep using html.parser
You can also use pandas, It will create, It's so much easier to learn from its documentaion (there is a lot).
How can I get the text "Lionel Messi" from this HTML code?
Lionel Messi
This is my code so far:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
page = requests.get('https://www.futbin.com/players')
soup = BeautifulSoup(page.content, 'lxml')
pool = soup.find(id='repTb')
player_names = pool.find_all(class_='player_name_players_table')
print(player_names[0])
When I print player_names I get this result:
/Users/ejps/PycharmProjects/scraper_players/venv/bin/python /Users/ejps/PycharmProjects/scraper_players/scraper.py
<a class="player_name_players_table" href="/20/player/44079/lionel-messi">Lionel Messi</a>
Process finished with exit code 0
But what code would I have to put in to get only the text of it?
I want to scrape all player names form that page in my code. But first I need to find a way to get that text extracted I think.
Cant find a way to make it work unfortunately.
I am new to python and try to do some projects to learn it.
EDIT:
With the help from comments I was able to get the text I need.
I only have one more question here.
Is it possible to find class_ by partial text only?
Like this:
prating = pool.find_all(class_='form rating ut20')
The full class would be
class="form rating ut20 toty gold rare"
but it is changing. The part that is always the same is "form rating ut20" so I thought maybe there is some kind of a placeholder that let me search for all "class" names inlcuding "form rating ut20"
Could you maybe help me with this as well?
To select specific class you can use either regular expression or if you have installed version bs4 4.7.1 or above you can use css selector.
Using regular expression will get list of element.
import re
prating = pool.find_all(class_=re.compile("form rating ut20"))
Or Using css selector will get list of element.1st css selector means contains and other one means starts-with.
prating = pool.select('[class*="form rating ut20"]')
OR
prating = pool.select('[class^="form rating ut20"]')
Get text using the getText() method.
player_names[0].getText()
I know versions of this question were asked in the past, but I'm still confused and would like to settle my doubts once and for all, if possible.
If I use
from bs4 import BeautifulSoup
my soup assignment is going to be
soup = BeautifulSoup(html, "lxml")
If I do the importing thus:
from bs4 import BeautifulSoup as bs4
my soup assignment is
soup = bs4(html, "lxml")
Finally, if I import using:
import bs4
my soup assignment is
soup = bs4.BeautifulSoup(html, "lxml")
Let's use a simple html and code:
html = """
Some Document
"""
link = soup.select('a:contains(Document)')
Next, the main question:
type(link[0])
The output - in all three import cases - is:
bs4.element.Tag
But if I ask:
isinstance(link[0],bs4.element.Tag)
In the third case, I get True, but in the first two cases, I get
AttributeError: type object 'BeautifulSoup' has no attribute 'element'
Since the select() and find_all() methods frequently deliver bothTag or NavigableString results, I need to determine which is which using, for example, isinstance(). So in those cases, do I have to use the third import method? Why is there a difference in the first place?
This is a naming game you are doing. Lets go ahead and state that class bs4.element.Tag is the class of element instances. Think of that as the absolute location of the Tag class in bs4. bs4.element represents the nested modules with Tag (which is found under the element module) being the class in which the elements are instances of. When displaying the class info of those elements, it will always show bs4.element.Tag.
Now, with all of that said, you can access the BeautifulSoup object in different ways. And none of this changes the fact that element tags are of type bs4.element.Tag. When you import bs4:
import bs4
bs4.BeautifulSoup()
This imports the module under the module's default name bs4. And then you can access BeautifulSoup in that module with the dot notation as BeautifulSoup is a member of that module. But locally bs4 is just a variable that references the bs4 module.
When you import as:
from bs4 import BeautifulSoup as bs4
bs4 does not mean the same thing as the first example. In the first example we imported the entire module under its default name (bs4), but here we instead import the BeautifulSoup class and rename it locally as bs4. Regardless of what we call it locally, it is still a class at bs4.BeautifulSoup, where bs4 is the module name. Locally though (local to this file), we created a variable reference to the BeautifulSoup class with a name that happens to be the same as the module.
So, when you use select to return elements, they are of the type bs4.element.Tag. This is true regardless of what your local variables happen to be named. This is internally how they are known.
So, when comparing instance, it is important to know, the variable name is not important, what is important is what the variable is referencing. In the third example, import bs4 causes bs4 to reference the bs4 module; therefore, Tag can be accessed at bs4.element.Tag. But in the case where you use from bs4 import BeautifulSoup as bs4, bs4 no longer references the bs4 module, it references the BeautifulSoup class which has no attributes called element with the attribute Tag as it is not a module but a class.
The local name is just how your current file is referencing the object it refers to.
So in your failing cases, you would need to import the Tag reference to a variable you can provide to instance:
>>> from bs4 import BeautifulSoup
>>> from bs4.element import Tag
>>> soup = bs4.BeautifulSoup('<div>Test<span>test</span><span>test2</span></div>')
>>> isinstance(soup.find('div'), Tag)
True
Tag here is just a name, but it references bs4.element.Tag, so it works.
We could call it anything and it will still work as long as it references the correct object:
>>> from bs4 import BeautifulSoup
>>> from bs4.element import Tag as Apple
>>> soup = bs4.BeautifulSoup('<div>Test<span>test</span><span>test2</span></div>')
>>> isinstance(soup.find('div'), Apple)
True
Hopefully that makes more sense :).
EDIT: Just a tip, but bs4 makes some references to things like NavigableString and Tag available in the top level module, so you don't have to reach all the way down to bs4.element to get a proper reference, you can simply do:
from bs4 import Tag, NavigableString
Again, this alternative reference of bs4.Tag is just a variable named Tag in the bs4 module that refers to the actual bs4.element.Tag class. You can use that, and it will still refer to the same class. It is just used locally in the bs4 module to reference the Tag class in element.
So I've read through all the questions about findAll() not working that I can find, and the answer always seems to be an issue with the particular html parser. I have run the following code using the default 'html.parser' along with 'lxml' and 'html5lib' yet I can only find one instance when I should be finding 14.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://robertsspaceindustries.com/pledge/ships'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, features = "lxml")
containers = page_soup.findAll("ul", {"class":"ships-listing"})
len(containers)
I tried posting a picture of the HTML code, but I don't have enough reputation. Here is a link to the image (https://imgur.com/a/cOLO8GO).
When you download a page through urllib (or requests HTTP library) it downloads the original HTML source file.
Initially there's only sinlge tag with the class name 'ships-listing' because that tag comes with the source page. But once you scroll down, the page generates additional <ul class='ships-listing'> and these elements are generated by the JavaScript.
So when you download a page using urllib, the downloaded content only contains the original source page (you could see it by view-source option in the browser).