Beautifulsoup Scraping Object Selection - beautifulsoup

I am trying to scrape by Beautifulsoup and new to this, I need table rows as you see enter image description here.
The tables are coming from reactapp and then shown on the website. I need suggestion how to do this. I am struggling to create the beautifulsoup object and do not know what the actual class to grap to reach table rows and their content.
webpage = urlopen(req).read()
soup = bs(webpage, "html.parser")
table=soup.find('table', {'class': 'equity'})
rows=list()
for row in table.findAll("tr"):
rows.append(row)
Need your help, really appreciated, having hard time to get it done!

You can grab the td elements with this code:
webpage = urlopen(req).read()
soup = bs(webpage, "lxml")
table=soup.find('table', {'class': 'table'}).find('tr')
rows=list()
for row in table.findAll("td"):
rows.append(row)
I prefered using lxml as the parser because it has some advantages, but you can keep using html.parser
You can also use pandas, It will create, It's so much easier to learn from its documentaion (there is a lot).

Related

I am not sure between which two elements I should be looking to scrape and formatting error (jupyter + selenium)

I finally got around to displaying the page that I need in text/HTML and did conclude that the data I need is also included. For now I just have it printing the entire page because I remain conflicted between the two elements that I potentially need to get what I want. Between these three highlighted elements 1, 2, and 3, I am having trouble with first identifying which one I should reference (I would go with the 'table' element but it doesn't highlight the left most column with ticker names which is literally half the point of getting this data, though the name is referenced like so as shown in the highlighted yellow part). Also, the class descriptions seem really long and and sometimes appears to have two within the same elements so I was wondering how I would address that? And though this problem is not as immediate, if you did take that code and just printed it and scrolled a bit down, the table data is in straight columns so I was wondering if that would be addressed after I reference the proper element or have to write something additional to fix it? Would the fact that I have multiple pages to scan also change anything in the code? Thank you in advance!
Code:
!pip install selenium
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome("D:/chromedriver/chromedriver.exe")
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
soup = BeautifulSoup(driver.page_source, 'html.parser')
# get text
text = soup.get_text()
print(text)
edit
read_html without bs4
You wont need beautifulsoup to get your goal, pandas is selecting all html tables from the page source and push them into a list of data frames.
In your case there is only one table in the page source, so you get your df by selecting the first element in list by slicing with [0]:
df = pd.read_html(driver.page_source)[0]
Example
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome('D:/chromedriver/chromedriver.exe')
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
df = pd.read_html(driver.page_source)[0]
driver.close()
Initial answer based on bs4
Your close to a solution, let pandas take control and read the html prettified and bs4 flavored to pandas and modify it there to your needs:
pd.read_html(soupt_one('table').prettify(), flavor='bs4')
Example
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome('D:/chromedriver/chromedriver.exe')
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
soup = BeautifulSoup(driver.page_source, 'html.parser')
df = pd.read_html(soup.select_one('table').prettify(), flavor='bs4')[0]
df

Python BeautifulSoup get text from class

How can I get the text "Lionel Messi" from this HTML code?
Lionel Messi
This is my code so far:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
page = requests.get('https://www.futbin.com/players')
soup = BeautifulSoup(page.content, 'lxml')
pool = soup.find(id='repTb')
player_names = pool.find_all(class_='player_name_players_table')
print(player_names[0])
When I print player_names I get this result:
/Users/ejps/PycharmProjects/scraper_players/venv/bin/python /Users/ejps/PycharmProjects/scraper_players/scraper.py
<a class="player_name_players_table" href="/20/player/44079/lionel-messi">Lionel Messi</a>
Process finished with exit code 0
But what code would I have to put in to get only the text of it?
I want to scrape all player names form that page in my code. But first I need to find a way to get that text extracted I think.
Cant find a way to make it work unfortunately.
I am new to python and try to do some projects to learn it.
EDIT:
With the help from comments I was able to get the text I need.
I only have one more question here.
Is it possible to find class_ by partial text only?
Like this:
prating = pool.find_all(class_='form rating ut20')
The full class would be
class="form rating ut20 toty gold rare"
but it is changing. The part that is always the same is "form rating ut20" so I thought maybe there is some kind of a placeholder that let me search for all "class" names inlcuding "form rating ut20"
Could you maybe help me with this as well?
To select specific class you can use either regular expression or if you have installed version bs4 4.7.1 or above you can use css selector.
Using regular expression will get list of element.
import re
prating = pool.find_all(class_=re.compile("form rating ut20"))
Or Using css selector will get list of element.1st css selector means contains and other one means starts-with.
prating = pool.select('[class*="form rating ut20"]')
OR
prating = pool.select('[class^="form rating ut20"]')
Get text using the getText() method.
player_names[0].getText()

Unable to pare the href tag in python

I get the following output in my beautiful soup.
[Search over 301,944 datasets\n]
I need to extract only the number 301,944 in this. Please guide me how this can be done. My code so far
import requests
import re
from bs4 import BeautifulSoup
source = requests.get('https://www.data.gov/').text
soup = BeautifulSoup (source , 'lxml')
#print soup.prettify()
images = soup.find_all('small')
print images
con = images.find_all('a') // I am unable to get anchor tag here. It says anchor tag not present
print con
#for con in images.find_all('a',href=True):
#print con
#content = images.split('metrics')
#print content[1]
#images = soup.find_all('a', {'href':re.compile('\d+')})
#print images
There is only one <small> tag on website.
Your images variable references it. But you use it in a wrong way to retrive anchor tag.
If you want to retrieve text from a tag you can get it with:
soup.find('small').a.text
where find method returns first small element it encounters on website. If you use find_all, you will get list of all small elements (but there's only one small tag here).

Scrape table data using pandas/beautiful soup (instead of selenium which is slow?), BS implementation not working

I am trying to scrape web data on this website, and the only way I was able to access the data was by iterating through the rows of the table, adding them to a list (then adding them to a pandas data frame/writing to a csv), and then clicking to the next page and repeating the process [there are about 50 pages per search and my program does 100+ searches]. It's super slow/inefficient, and I was wondering if there was a way to efficiently add all the data using pandas or beautiful soup instead of iterating through each line/column.
url = "https://claimittexas.org/app/claim-search"
rows = driver.find_elements_by_xpath("//tbody/tr")
try:
for row in rows[1:]:
row_array = []
#print(row.text) # prints the whole row
for col in row.find_elements_by_xpath('td')[1:]:
row_array.append(col.text.strip())
table_array.append(row_array)
df = pd.DataFrame(table_array)
df.to_csv('my_csv.csv', mode='a', header=False)
except:
print(letters + "no table exists")
EDIT: I tried to scrape using beautiful soup, something I tried earlier in the week and posted about, but I can't seem to access the table without using selenium
With the bs version, I put a bunch of print statements in to see what was wrong, and it has that the rows value is just an empty list
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
rows = soup.find('table').find('tbody').find_all(('tr')[1:])
for row in rows[1:]:
cells = row.find_all('td')
for cell in cells[1:]:
print(cell.get_text())
use this line in BS4 code implimentation
rows = soup.find('table').find('tbody').find_all('tr')[1:]
instead of
rows = soup.find('table').find('tbody').find_all(('tr')[1:])

Beautiful Soup NoneType error

I am trying to write a small script in Python to help me through some of the more tedious parts of my job. I wrote this:
from bs4 import BeautifulSoup
import lxml
import os
import sys
questid = "1478"
soup = BeautifulSoup(open("Data/xmls/quests.xml"), "lxml")
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
questSoup = BeautifulSoup(quest)
for floor in questSoup.find_all('location_id'):
print(floor)
What this is supposed to do is to get a part of a huge xml called "quests", based on tag - and its attribute - "id". Then it is supposed to make a new soup from that part and get all the tags from within the . For now, before I figure out which quest ids I want to choose (and how will I handle input), I just hardcoded one ("1478").
The script so far prints the quest, but fails to create a new soup from it.
Is it possible that the quest variable is not a string? Or am I doing something wrong?
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
# questSoup = BeautifulSoup(quest)
for floor in quest.find_all('location_id'):
print(floor)
No need to build a new soup object from tag object, you can use find_all on both of them, as both are navigable strings, so they behave in the same way and can be accessed in the same way.
In my opinion, soup object is special tag object which is named document
import requests, bs4
r =requests.get('http://www.google.com')
soup = bs4.BeautifulSoup(r.text, 'lxml')
soup.name
out:
'[document]'