How to find href links that start with a certain keyword using beautiful soup? - beautifulsoup

The task I am doing right now is very monotonous. In this task I have to go to this website eg page. You can see that there is a hyperlink attached to each case in the Status column. I am trying to find a way in which I can grab certain href that start with keyword case-details. As they are the links from status column for each particular case. Since the hyperlinks contain details regarding the cases.
My code:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import re
url = "https://nclt.gov.in/order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=40"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
Which gives the following output (added line numbers for clarity..):
....
44 /order-judge-wise
45 order-judgement-date-wise
46 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ3MjAxOQ==
47 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMA==
48 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMQ==
49 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUwMjAyMQ==
50 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUxMjAyMA==
51 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUyMjAyMA==
52 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMA==
53 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMQ==
54 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU1MjAyMQ==
55 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU3MjAyMQ==
56 order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=39
57 order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=1
....
I want to grab the href links that start with "case-details" and put them into a list. Which I later use to scrap details of each case and put them into an excel file.
Till now I've tried to make a loop that looks for these links:
for link in soup.find_all('a'):
if "case" in link.get_text():
print(link['href'])
But till now, no success, I also want to know how to make this into a list.
expected output:
url_list1 = ["case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTAzMjAyMA==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA1MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA2MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA3MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA4MjAxNw==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA4MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA5MjAyMA==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA5MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTEwMjAxOQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTEwMjAyMQ=="]

Selecting only these <a> with href starts with case-details you could use css selectors:
soup.select('a[href^="case-details"]')
be aware you have to prepend a baseUrl e.g. with list comprehension:
['https://nclt.gov.in/'+a['href'] for a in soup.select('a[href^="case-details"]')]
Example
import requests
from bs4 import BeautifulSoup
url = "https://nclt.gov.in/order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=40"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
urls = ['https://nclt.gov.in/'+a['href'] for a in soup.select('a[href^="case-details"]')]
Output
['https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ3MjAxOQ==',
'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMA==',
'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMQ==',
'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUwMjAyMQ==',
'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUxMjAyMA==',
'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUyMjAyMA==',
'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMA==',
'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMQ==',
'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU1MjAyMQ==',
'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU3MjAyMQ==']

Related

WebScrape - Pagination/Next page

The current code works and scrapes the page how I want it to.
However, how can I get this to run for the next page ? The URL is not unique for the second page and I want to run for all pages.
import requests
from bs4 import BeautifulSoup as bs
lists=[]
r = requests.get('https://journals.lww.com/ccmjournal/toc/2022/01001')
soup = bs(r.content, 'lxml')
d = {i.text.strip():i['href'] for i in soup.select('.ej-toc-subheader + div h4 > a')}
lists.append(d)
It's a dynamically loaded webpage you need to use selenium for that. Have a look here : Selenium with Python

How store values together after scrape

I am able to scrape individual fields off a website, but would like to map the title to the time.
The fields "have their own class, so I am struggling on how to map the time to the title.
A dictionary would work, but how would i structure/format this dictionary so that it stores values on a line by line basis?
url for reference - https://ash.confex.com/ash/2021/webprogram/STUDIO.html
expected output:
9:00 AM-9:30 AM, Defining Race, Ethnicity, and Genetic Ancestry
11:00 AM-11:30 AM, Definitions of Structural Racism
etc
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
import time
driver.get('https://ash.confex.com/ash/2021/webprogram/STUDIO.html')
time.sleep(3)
page_source = driver.page_source
soup=BeautifulSoup(page_source,'html.parser')
productlist=soup.find_all('div',class_='itemtitle')
for item in productlist:
for eachLine in item.find_all('a',href=True):
title=eachLine.text
print(title)
times=driver.find_elements_by_class_name("time")
for t in times:
print(t.text)
Selenium is an overkill here. Website didn't use any dynamic content, so you can scrape it with Python requests and BeautifulSoup. Here is a code how to achieve it. You need to query productlist and times separately and then iterate using indexes to be able to get both items at once. I put in range() length of an productlist because I assuming that both productlist and times will have equal length.
import requests
from bs4 import BeautifulSoup
url = 'https://ash.confex.com/ash/2021/webprogram/STUDIO.html'
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')
productlist = soup.select('div.itemtitle > a')
times = soup.select('.time')
for iterator in range(len(productlist)):
row = times[iterator].text + ", " + productlist[iterator].text
print(row)
Note: soup.select() gather items by css.

I am not sure between which two elements I should be looking to scrape and formatting error (jupyter + selenium)

I finally got around to displaying the page that I need in text/HTML and did conclude that the data I need is also included. For now I just have it printing the entire page because I remain conflicted between the two elements that I potentially need to get what I want. Between these three highlighted elements 1, 2, and 3, I am having trouble with first identifying which one I should reference (I would go with the 'table' element but it doesn't highlight the left most column with ticker names which is literally half the point of getting this data, though the name is referenced like so as shown in the highlighted yellow part). Also, the class descriptions seem really long and and sometimes appears to have two within the same elements so I was wondering how I would address that? And though this problem is not as immediate, if you did take that code and just printed it and scrolled a bit down, the table data is in straight columns so I was wondering if that would be addressed after I reference the proper element or have to write something additional to fix it? Would the fact that I have multiple pages to scan also change anything in the code? Thank you in advance!
Code:
!pip install selenium
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome("D:/chromedriver/chromedriver.exe")
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
soup = BeautifulSoup(driver.page_source, 'html.parser')
# get text
text = soup.get_text()
print(text)
edit
read_html without bs4
You wont need beautifulsoup to get your goal, pandas is selecting all html tables from the page source and push them into a list of data frames.
In your case there is only one table in the page source, so you get your df by selecting the first element in list by slicing with [0]:
df = pd.read_html(driver.page_source)[0]
Example
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome('D:/chromedriver/chromedriver.exe')
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
df = pd.read_html(driver.page_source)[0]
driver.close()
Initial answer based on bs4
Your close to a solution, let pandas take control and read the html prettified and bs4 flavored to pandas and modify it there to your needs:
pd.read_html(soupt_one('table').prettify(), flavor='bs4')
Example
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome('D:/chromedriver/chromedriver.exe')
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
soup = BeautifulSoup(driver.page_source, 'html.parser')
df = pd.read_html(soup.select_one('table').prettify(), flavor='bs4')[0]
df

Web scraping using python throws empty array

import requests
from bs4 import BeautifulSoup as soup
my_url='http://cvr.ac.in/home4/index.php/eee-sp-870859316/eeefaculty'
page=requests.get(my_url)
data=page.text
page_soup=soup(data,'html.parser')
cont=page_soup.select("div",{"class": "item-page"})
print(cont)
I am trying to scrape the faculty details name, designation , profile into a csv file .
when I use above code it throws empty [].
any help greatly appreciated.
The page is looking for any of a defined set of valid user agents. For example,
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://cvr.ac.in/home4/index.php/eee-sp-870859316/eeefaculty', headers = {'User-Agent': 'Chrome/80.0.3987.163'})
soup = bs(r.content, 'lxml')
print(soup.select('.item-page'))
Without that, you get an 406 response and the classes you are looking for in the html are not present.

Using BS, I cannot "find" the ID of info, when I know it exists

I am a new user to Beautiful Soup and am trying to create a baby application that retrieves the view count from a YouTube url.
So, I looked at the BS docs and I saw that you could retrieve items by their id. So I attempted to retrieve the info id - but whenever I attempt to do this, it comes out as "None", so it must not be finding the id.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/watch?v=bUPvE5yv72I'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
divisions = soup.findAll("div")
print(divisions[0])
info = soup.find(id="info")
print(info)
You can try to search for meta itemprop="interactionCount" tag, but this value can be often not exact. Best way is using the official YouTube API:
import requests
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/watch?v=bUPvE5yv72I'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
print(soup.select_one('meta[itemprop="interactionCount"][content]')['content'])
Prints:
165011