Extract text from html using BeautifulSoup - beautifulsoup

I'm new to python and BeautifulSoup and need help writing a for loop to retrieve some text values from html. Also new to stack overflow :-)
I am able to crawl the webpage using the td tag below and find rows which have employees of a company that I want to add to a list. Not sure how to write the for loop that will disregard the tags and just retrieve the text value (ie employee names) from each row, and then add that to a new list, employees. So in example below, how do I retrive John Doe, Bob Smith etc and that to a list? Any help appreciated.
import requests
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import re
url = 'my target URL'
target_url= uReq(url)
target_html = target_url.read()
soupy = soup(target_html, 'html.parser')
print(soupy.prettify())
employees = []
employees = soupy.findAll('td', headers= 'table5593r1')
employees
<td headers="'table5593r1">Mr John Doe</td>,
<td headers="'table5593r1">Dr Bob Smith</td>,
<td headers="'table5593r1">Dr Jane Do</td>,
<td headers="'table5593r1">Ms Mary Jane</td>,

This post shows how you can get the text of HTML elements/tags. In order to add the employee names to a new list, you could do the following:
employees = soupy.findAll('td', headers= 'table5593r1')
employeeNames = []
for employee in employees:
employeeName = employee.text
employeeNames.append(employeeName.strip())
I'd also recommend to take a further look into this post regarding the looping over a list of HTML elements.

Related

How to find href links that start with a certain keyword using beautiful soup?

The task I am doing right now is very monotonous. In this task I have to go to this website eg page. You can see that there is a hyperlink attached to each case in the Status column. I am trying to find a way in which I can grab certain href that start with keyword case-details. As they are the links from status column for each particular case. Since the hyperlinks contain details regarding the cases.
My code:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import re
url = "https://nclt.gov.in/order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=40"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
Which gives the following output (added line numbers for clarity..):
....
44 /order-judge-wise
45 order-judgement-date-wise
46 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ3MjAxOQ==
47 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMA==
48 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMQ==
49 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUwMjAyMQ==
50 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUxMjAyMA==
51 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUyMjAyMA==
52 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMA==
53 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMQ==
54 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU1MjAyMQ==
55 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU3MjAyMQ==
56 order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=39
57 order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=1
....
I want to grab the href links that start with "case-details" and put them into a list. Which I later use to scrap details of each case and put them into an excel file.
Till now I've tried to make a loop that looks for these links:
for link in soup.find_all('a'):
if "case" in link.get_text():
print(link['href'])
But till now, no success, I also want to know how to make this into a list.
expected output:
url_list1 = ["case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTAzMjAyMA==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA1MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA2MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA3MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA4MjAxNw==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA4MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA5MjAyMA==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA5MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTEwMjAxOQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTEwMjAyMQ=="]
Selecting only these <a> with href starts with case-details you could use css selectors:
soup.select('a[href^="case-details"]')
be aware you have to prepend a baseUrl e.g. with list comprehension:
['https://nclt.gov.in/'+a['href'] for a in soup.select('a[href^="case-details"]')]
Example
import requests
from bs4 import BeautifulSoup
url = "https://nclt.gov.in/order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=40"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
urls = ['https://nclt.gov.in/'+a['href'] for a in soup.select('a[href^="case-details"]')]
Output
['https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ3MjAxOQ==',
'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMA==',
'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMQ==',
'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUwMjAyMQ==',
'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUxMjAyMA==',
'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUyMjAyMA==',
'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMA==',
'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMQ==',
'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU1MjAyMQ==',
'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU3MjAyMQ==']

Scrape wikipedia table using BeautifulSoup

I would like to scrape the table titled "List of chemical elements" from the wikipedia link below and display it using pandas
https://en.wikipedia.org/wiki/List_of_chemical_elements
I am new to beautifulsoup and this is currently what i have.
from bs4 import BeautifulSoup
import requests as r
import pandas as pd
response = r.get('https://en.wikipedia.org/wiki/List_of_chemical_elements')
wiki_text = response.text
soup = BeautifulSoup(wiki_text, 'html.parser')
table_soup = soup.find_all('table')
You can select the table with beautifulsoup in different ways:
By its "title":
soup.select_one('table:-soup-contains("List of chemical elements")')
By order in tree (it is the first one):
soup.select_one('table')
soup.select('table')[0]
By its class (there is no id in your case):
soup.select_one('table.wikitable')
Or simply with pandas
pd.read_html('https://en.wikipedia.org/wiki/List_of_chemical_elements')[0]
*To get the expected result, try it yourself and if you have difficulties, ask a new question.

How store values together after scrape

I am able to scrape individual fields off a website, but would like to map the title to the time.
The fields "have their own class, so I am struggling on how to map the time to the title.
A dictionary would work, but how would i structure/format this dictionary so that it stores values on a line by line basis?
url for reference - https://ash.confex.com/ash/2021/webprogram/STUDIO.html
expected output:
9:00 AM-9:30 AM, Defining Race, Ethnicity, and Genetic Ancestry
11:00 AM-11:30 AM, Definitions of Structural Racism
etc
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
import time
driver.get('https://ash.confex.com/ash/2021/webprogram/STUDIO.html')
time.sleep(3)
page_source = driver.page_source
soup=BeautifulSoup(page_source,'html.parser')
productlist=soup.find_all('div',class_='itemtitle')
for item in productlist:
for eachLine in item.find_all('a',href=True):
title=eachLine.text
print(title)
times=driver.find_elements_by_class_name("time")
for t in times:
print(t.text)
Selenium is an overkill here. Website didn't use any dynamic content, so you can scrape it with Python requests and BeautifulSoup. Here is a code how to achieve it. You need to query productlist and times separately and then iterate using indexes to be able to get both items at once. I put in range() length of an productlist because I assuming that both productlist and times will have equal length.
import requests
from bs4 import BeautifulSoup
url = 'https://ash.confex.com/ash/2021/webprogram/STUDIO.html'
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')
productlist = soup.select('div.itemtitle > a')
times = soup.select('.time')
for iterator in range(len(productlist)):
row = times[iterator].text + ", " + productlist[iterator].text
print(row)
Note: soup.select() gather items by css.

Python BeautifulSoup get text from class

How can I get the text "Lionel Messi" from this HTML code?
Lionel Messi
This is my code so far:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
page = requests.get('https://www.futbin.com/players')
soup = BeautifulSoup(page.content, 'lxml')
pool = soup.find(id='repTb')
player_names = pool.find_all(class_='player_name_players_table')
print(player_names[0])
When I print player_names I get this result:
/Users/ejps/PycharmProjects/scraper_players/venv/bin/python /Users/ejps/PycharmProjects/scraper_players/scraper.py
<a class="player_name_players_table" href="/20/player/44079/lionel-messi">Lionel Messi</a>
Process finished with exit code 0
But what code would I have to put in to get only the text of it?
I want to scrape all player names form that page in my code. But first I need to find a way to get that text extracted I think.
Cant find a way to make it work unfortunately.
I am new to python and try to do some projects to learn it.
EDIT:
With the help from comments I was able to get the text I need.
I only have one more question here.
Is it possible to find class_ by partial text only?
Like this:
prating = pool.find_all(class_='form rating ut20')
The full class would be
class="form rating ut20 toty gold rare"
but it is changing. The part that is always the same is "form rating ut20" so I thought maybe there is some kind of a placeholder that let me search for all "class" names inlcuding "form rating ut20"
Could you maybe help me with this as well?
To select specific class you can use either regular expression or if you have installed version bs4 4.7.1 or above you can use css selector.
Using regular expression will get list of element.
import re
prating = pool.find_all(class_=re.compile("form rating ut20"))
Or Using css selector will get list of element.1st css selector means contains and other one means starts-with.
prating = pool.select('[class*="form rating ut20"]')
OR
prating = pool.select('[class^="form rating ut20"]')
Get text using the getText() method.
player_names[0].getText()

Scraping categories and subcategories using beautifulsoup

I am trying to retrieve all categories and subcategories within a website. I am able to use BeautifulSoup to pull every single product in the category once I am in it. However, I am struggling with the loop for categories. I'm using this as a test website: http://www.shophive.com.
How do I loop through each category as well as the subcategories on the left side of the website? I would like to extract all products within the category/subcategory and display on my page.
from bs4 import BeautifulSoup
import user_agent
import requests
useragent = user_agent.generate_user_agent(device_type='desktop')
headers = {'User-Agent': useragent}
req = requests.get('http://www.shophive.com/', headers=headers)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
main_category_links = []
for div in soup.find_all('div', class_='parentMenu arrow'):
main_category_links.append(soup.find('a').get('href'))
print(main_category_links)
subcategory_links = []
for link in soup.find_all('a', class_='itemMenuName'):
subcategory_links.append(link.get('href'))
print(subcategory_links)
I'll break this down for you piece by piece.
useragent = user_agent.generate_user_agent(device_type='desktop')
headers = {'User-Agent': useragent}
req = requests.get('http://www.shophive.com/', headers=headers)
html = req.text
Here we just make the request and store the HTML. I use a module called "user_agent" to generate a User Agent to use in the headers, just my preference.
<div class="parentMenu arrow">
<a href="http://www.shophive.com/year-end-clearance-sale">
<span>New Year's Clearance Sale</span>
</a>
</div>
The links for the main categories are stored like so, so in order to extract just the links we do this:
main_category_links = []
for div in soup.find_all('div', class_='parentMenu arrow'):
main_category_links.append(soup.find('a').get('href'))
We iterate over the results of soup.find_all('div', class_='parentMenu arrow') since elements the links we want are children of these elements. Then we append soup.find('a').get('href') to our list of main category links. We use soup.find this time because we only want one result, then we get the contents of the href.
<a class="itemMenuName level1" href="http://www.shophive.com/apple/mac">
<span>Mac / Macbooks</span>
</a>
The subcategories are stored like this, notice the "a" tag has a class this time, this makes it a little easier for us to find it.
subcategory_links = []
for link in soup.find_all('a', class_='itemMenuName'):
subcategory_links.append(link.get('href'))
Here we iterate over soup.find_all('a', class_='itemMenuName'). When you search for classes in BeautifulSoup, you can just search for part of the class name. This is helpful to us in this case since the class name varies from itemMenuName level1 to itemMenuName level2. These elements have the link inside of them already, so we just extract the contents of the href that holds the URL with link.get('href') and append it to our list of subcategory links.