Can't find data while scraping site using BeautifulSoup or Selenium

Can't find data while scraping site using BeautifulSoup or Selenium - selenium

I am trying to scrape a site for link to the newest factsheet. I've tried using Selenium and BeautifulSoup, however each time I am unable to find the link using the tools. For instance when checking the output using Soup I get nothing from the part. Any suggestions?
Link to site scraped site
Using selenium:
#BIOG
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
driver.get('https://www.biotechgt.com/performance/monthly-factsheets')
html = driver.page_source
driver.find_elements(By.XPATH, '/html/body/div/main/section/div/div/div/div/div[2]/div/div[1]/div[2]/div/table/tbody[1]/tr[2]/td/a')

To get all download links from the page, you can use next example:
import requests
from bs4 import BeautifulSoup
url = "https://www.biotechgt.com/performance/monthly-factsheets"
soup = BeautifulSoup(
requests.get(url, cookies={"dp-disclaimer": "APPROVED"}).content,
"html.parser",
)
for a in soup.select("a.gtm-downloads:has(.btn-download)"):
print(a["href"])
Prints:
https://www.biotechgt.com/download_file/force/191/209
https://www.biotechgt.com/download_file/force/187/209
https://www.biotechgt.com/download_file/force/185/209
https://www.biotechgt.com/download_file/force/184/209
...

You have page source
html = driver.page_source
but you are not using it in soup at all.
so change that :
soup = BeautifulSoup(driver.page_source, "lxml")
As far as Selenium is concerned :
You can use below css selector :
a[href^='https://www.biotechgt.com/download']
in code
ele = driver.find_element(By.CSS_SELECTOR, "a[href^='https://www.biotechgt.com/download']")
then you can do
ele.click() or any other stuff with web element.
Update 1:
driver.maximize_window()
driver.implicitly_wait(30)
driver.get("https://www.biotechgt.com/performance/monthly-factsheets")
wait = WebDriverWait(driver, 10)
wait.until(EC.element_to_be_clickable((By.XPATH, "//a[text()=' Allow all cookies ']"))).click()
driver.execute_script("var scrollingElement = (document.scrollingElement || document.body);scrollingElement.scrollTop = scrollingElement.scrollHeight;")
wait.until(EC.element_to_be_clickable((By.XPATH, "//button[text()='Accept']"))).click()
ActionChains(driver).move_to_element(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a[href^='https://www.biotechgt.com/download']")))).perform()
for link in driver.find_elements_by_css_selector("a[href^='https://www.biotechgt.com/download']"):
print(link.get_attribute('href'))

Related

How to scrape company names from inc5000?

I am trying to scrape all company names from inc5000 site ("https://www.inc.com/inc5000/2021"). The problem is that the company names are displayed using JavaScript. I have tried using selenium and requests_html both to render the site but still when I fetch source code of page i get JavaScript. This is what I tried. I am new to web scraping so it is possible that I am making some foolish mistake. please guide
Here is my code.
...
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
options = Options()
options.headless = True
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.get("https://www.inc.com/inc5000/2021")
data=driver.page_source
print(data)
...

You could give some time to render or use seleniums waits:
...
import time
driver.get('https://www.inc.com/inc5000/2021')
time.sleep(5)
data = driver.page_source
soup = BeautifulSoup(data)
for e in soup.select('.company'):
print(e.text)
...

Why do you need beautiful soup, you just could use selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://www.inc.com/inc5000/2021")
companies = [e.text for e in driver.find_elements(By.CLASS_NAME, "company")]
This will only give you the elements in the viewport. You need to improve on that by scrolling.

Unable to parse element Selenium

I am trying to parse the date element ("3 February 2022") on the following webpage. However, I am unable to find it, even when using selenium to load it. Any suggestions to what I am doing wrong? Currently trying with the following code:
import requests as re
from bs4 import BeautifulSoup
import time
import re
from selenium import webdriver
url = "http://www.londonstockexchange.com/news-article/SAIN/net-asset-value-s/15316710"
driver = webdriver.Chrome()
driver.get(url)
time.sleep(5)
soup = str(BeautifulSoup(driver.page_source, 'html.parser'))
date = re.findall("[0-9]{1,2}\s[A-Z][a-z]+\s[0-9]{4}", soup)
print(f'Tager {date[-1]} ud af mulige datoer: {date}')

Unable to log in using selenium

I am trying to scrape this website using python's BeautifulSoup package and for automating the user flow I am using selenium. As this website requires authentication to access this page, I am trying to log in first using selenium webdriver. Here is my code:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
def configure_driver():
# Add additional Options to the webdriver
chrome_options = Options()
# add the argument and make the browser Headless.
chrome_options.add_argument("--headless")
# Instantiate the Webdriver: Mention the executable path of the webdriver you have downloaded
# For linux/Mac
# driver = webdriver.Chrome(options = chrome_options)
# For windows
driver = webdriver.Chrome(executable_path="/home/<user_name>/Downloads/chromedriver_linux64/chromedriver",
options = chrome_options)
return driver
def getLinks(driver):
# Step 1: Go to pluralsight.com, category section with selected search keyword
driver.get(f"https://www.coursera.org/learn/competitive-data-science/supplement/RrrDR/additional-material-and-links")
# wait for the element to load
try:
WebDriverWait(driver, 5).until(lambda s: s.find_element_by_class_name("_ojjigd").is_displayed())
except TimeoutException:
print("TimeoutException: Element not found")
return None
email = driver.find_element_by_name('email')
print(str(email))
password = driver.find_element_by_name('password')
email.send_keys("username") # provide some actual username
password.send_keys("password") # provide some actual password
form = driver.find_element_by_name('login')
print(form.submit())
WebDriverWait(driver, 10)
print(driver.title)
soup = BeautifulSoup(driver.page_source)
# Step 3: Iterate over the search result and fetch the course
divs = soup.findAll('div', args={'class': 'item-box-content'})
print(len(divs))
# create the driver object.
driver = configure_driver()
getLinks(driver)
# close the driver.
driver.close()
Now after doing form.submit() it is expected to log in and change the page, right? But it is simply staying in the same page, so I cannot access the contents of the authenticated page. Someone please help.

That is because there is no name attribute.
instead of this :
form = driver.find_element_by_name('login')
Use this :
wait.until(EC.element_to_be_clickable((By.XPATH, "//button[text()='Login']"))).click()
I tried this code on my local, seems to be working fine
driver.maximize_window()
wait = WebDriverWait(driver, 30)
driver.get("https://www.coursera.org/learn/competitive-data-science/supplement/RrrDR/additional-material-and-links")
wait.until(EC.element_to_be_clickable((By.ID, "email"))).send_keys("some user name")
wait.until(EC.element_to_be_clickable((By.ID, "password"))).send_keys("some user name")
wait.until(EC.element_to_be_clickable((By.XPATH, "//button[text()='Login']"))).click()
Since login button is in a form so .submit() should work too.
wait.until(EC.element_to_be_clickable((By.XPATH, "//button[text()='Login']"))).submit()
This works too.

beautiful soup returns none from sainsbury's

Similar to beautiful soup find returns none from rightmove, but for a different site: https://www.sainsburys.co.uk/gol-ui/SearchDisplayView?filters[keyword]=milk
I try running:
url='https://www.sainsburys.co.uk/gol-ui/SearchDisplayView?filters[keyword]=banana'
# configure driver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\\chromedriver.exe" # IF NOT IN SAME FOLDER CHANGE THIS PATH
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver)
driver.get(url)
page = driver.page_source
page_soup = soup(page,'html.parser')
container_tag1='pt__content'
containers = page_soup.findAll("div",{"class":container_tag1})
# print(containers)
print(len(containers))
to no avail.
I tried without selenium, and failed as well.
Any suggestions?

You have to wait for the page to fully render before passing the HTML to BeautifulSoup. One option is to use the .sleep method from the built-in time module.
from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup
URL = "https://www.sainsburys.co.uk/gol-ui/SearchDisplayView?filters[keyword]=banana"
driver = webdriver.Chrome(r"c:\path\to\chromedriver.exe")
driver.get(URL)
sleep(5) # <-- Wait for the page to fully render
soup = BeautifulSoup(driver.page_source, "html.parser")
print(soup.find_all("div", {"class": "pt__content"}))

Not able to get hidden contents of a website

I am trying to scrape a website with the help of BeautifulSoup. I am not able to get the contents of the website but it is on the source code when I inspect the site.
import requests
import urllib
from bs4 import BeautifulSoup
url1 = 'https://recruiting.ultipro.com/usg1006/JobBoard/dfc53730-57d1-3460-336f-ddafabd108f3/?q=&o=postedDateDesc'
response1 = get(url1)
print(response1.text[:500])
html_soup1 = BeautifulSoup(response1.text, 'html.parser')
type(html_soup1)
all_info1 = html_soup1.find("div", {"data-bind": "foreach: opportunities"})
all_info1
all_automation1 = all_info1.find_all("div",{"data-automation":"opportunity"})
all_automation1
In the source code there is "job-title", "location" and "description" and other details but I am not able to see the same details in the html contents.

You should try like this or anything similar to fetch the title from that page:
import time
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://recruiting.ultipro.com/usg1006/JobBoard/dfc53730-57d1-3460-336f-ddafabd108f3/?q=&o=postedDateDesc')
time.sleep(3) #let the browser load it's content
soup = BeautifulSoup(driver.page_source,'lxml')
for item in soup.select("h3 .opportunity-link"):
print(item.text)
driver.quit()

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Can't find data while scraping site using BeautifulSoup or Selenium - selenium

Related

How to scrape company names from inc5000?

Unable to parse element Selenium

Unable to log in using selenium

beautiful soup returns none from sainsbury's

Not able to get hidden contents of a website

Categories

Resources