i'm trying to crawling this part
enter image description here
sometimes selenium got that part, but sometimes not.
But i don't figure out the reason why.
my code is below:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(r'C:(my path)\chromedriver.exe')
url = 'https://www.rocketpunch.com/companies?page=1'
driver.get(url)
driver.implicitly_wait(5)
html = driver.page_source
print(html)
soup = BeautifulSoup(html, 'lxml')
comments = soup.findAll('h4', {'class': 'header name'})
for comment in comments:
print(comment)
After the driver.implicitly_wait(5) you are using here just add some short fixed delay like time.sleep(1)
implicitly_wait waits until some element is found. It doesn't wait for all the elements / the entire page to be loaded.
Related
I am trying to scrape data from google finance with following link
https://www.google.com/finance/quote/ACN:NYSE
The section I am trying to fetch is on right side containing information like market cap, p/e ratio etc.
Earlier I thought it was javascript and wrote the following snippet:
class_name = 'gyFHrc'
options = Options()
options.headless = True
service = Service('/usr/local/bin/geckodriver')
browser = Firefox(service=service, options=options)
browser.get(base_url+suffix)
wait = WebDriverWait(browser, 15)
wait.until(presence_of_element_located((By.CLASS_NAME, class_name))) # <--line 58
stuff = browser.find_elements(By.CLASS_NAME, class_name)
print(f'stuff-->{stuff}')
for elem in stuff:
html = elem.get_attribute("outerHTML")
# print(f'html:{html}')
I get the following error:
File "scraping_google_finance_js.py", line 58, in <module>
wait.until(presence_of_element_located((By.CLASS_NAME, class_name)))
File "/Users/me/opt/anaconda3/envs/scraping/lib/python3.10/site-packages/selenium/webdriver/support/wait.py", line 90, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
WebDriverError#chrome://remote/content/shared/webdriver/Errors.jsm:183:5
NoSuchElementError#chrome://remote/content/shared/webdriver/Errors.jsm:395:5
element.find/</<#chrome://remote/content/marionette/element.js:300:16
Later, I realised that this was plain HTML and I can use BeautifulSoup as follows:
class_name = 'gyFHrc'
soup = BeautifulSoup(html, 'html.parser')
box_rows = soup.find_all("div", class_name)
print(box_rows)
for row in box_rows:
print(type(row), str(row.contents[1].contents))
This worked with following output:
<class 'bs4.element.Tag'> ['$295.14']
<class 'bs4.element.Tag'> ['$289.67 - $298.00']
<class 'bs4.element.Tag'> ['$261.77 - $417.37']
.....
The question is, why did it not work with Selenium? Did I do something wrong? or Selenium only works with Javascript site?
Clearly time to load the page was not the problem as BeautifulSoup could fetch and parse the page
The error selenium.common.exceptions.TimeoutException says the element you are trying to load or find is not found within the given time.
Probably your internet is slow to load the stuff in time. Increase the wait time to get the result.
This error usually happens when selenium can't find the desired tag or element. But in your case, the element was there.
I checked the code with a few changes, and it worked for me so it's probably an issue with the element loading in time.
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(ChromeDriverManager().install())
class_name = "gyFHrc"
driver.get("https://www.google.com/finance/quote/ACN:NYSE")
wait = WebDriverWait(driver, 15)
wait.until(EC.presence_of_element_located((By.CLASS_NAME, class_name))) # <--line 58
stuff = driver.find_elements(By.CLASS_NAME, class_name)
print(f"stuff-->{stuff}")
for elem in stuff:
html = elem.get_attribute("outerHTML")
print(f"html:{html}")
Result
I'm trying to use Python (Selenium, BeautifulSoup, and XPath) to scrape a span with an itemprop equal to "description", but every time I run the code, the "try" fails and it prints out the "except" error.
I do see the element in the code when I inspect elements on the page.
Line that isn't getting the desired response:
quick_overview = soup.find_element_by_xpath("//span[contains(#itemprop, 'description')]")
Personally, I think you should just keep working with selenium
quick_overview = driver.find_element_by_xpath("//span[contains(#itemprop, 'description')]")
for the element and add .text to end to get the text content.
To actually use soup to parse this out you would likely need a wait condition from selenium first so no real point.
However, should you decide to integrate bs4 then you need to change your function to work with the actual html from driver.page_source and parse that, then switch to select_one to grab your item. Then ensure you are returning from the function and assigning to new soup object.
from bs4 import BeautifulSoup
from selenium import webdriver # links w/ browser and carries out actions
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
PATH = "C:\Program Files (x86)\chromedriver_win32\chromedriver.exe"
baseurl = "http://www.waytekwire.com"
skus_to_find_test = ['WL16-8', 'WG18-12']
driver = webdriver.Chrome(PATH)
driver.get(baseurl)
def use_driver_current_html(driver):
soup = BeautifulSoup(driver.page_source, 'lxml')
return soup
for sku in skus_to_find_test[0]:
search_bar = driver.find_element_by_id('themeSearchText')
search_bar.send_keys(sku)
search_bar.send_keys(Keys.RETURN)
try:
product_url = driver.find_elements_by_xpath("//div[contains(#class, 'itemDescription')]//h3//a[contains(text(), sku)]")[0]
product_url.click()
WebDriverWait(driver,10).until(EC.presence_of_element_located((By.XPATH, "//span[contains(#itemprop, 'description')]")))
soup = use_driver_current_html(driver)
try:
quick_overview = soup.select_one("span[itemprop=description]").text
print(quick_overview)
except:
print('No Quick Overview Found.')
except:
print('Product not found.')
I wrote the python code for web-scraping Sydney morning herald newspaper. This code first clicks all the show more button and then scrape all the articles. Selenium part is working correctly. But I think there is some problem in the scraping part, as after scraping the desired fields (date, title, and content)for few articles (5-6) it is only giving date and title, no content.
import time
import csv
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
base = 'https://www.smh.com.au'
browser = webdriver.Safari(executable_path='/usr/bin/safaridriver')
wait = WebDriverWait(browser, 10)
browser.get('https://www.smh.com.au/search?text=cybersecurity')
while True:
try:
time.sleep(2)
show_more = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, '_3we9i')))
show_more.click()
except Exception as e:
print(e)
break
soup = BeautifulSoup(browser.page_source,'lxml')
anchors = soup.find_all('a', {'tabindex': '-1'})
for anchor in anchors:
browser.get(base + anchor['href'])
sub_soup = BeautifulSoup(browser.page_source, 'html.parser')
dateTag = sub_soup.find('time', {'class': '_2_zR-'})
titleTag = sub_soup.find('h1', {'itemprop': 'headline'})
contentTag = sub_soup.find_all('div', {'class': '_1665V undefined'})
date = None
title = None
content = None
if isinstance(dateTag, Tag):
date = dateTag.get_text().strip()
if isinstance(titleTag, Tag):
title = titleTag.get_text().strip()
if isinstance(contentTag, list):
content = []
for c in contentTag:
content.append(c.get_text().strip())
content = ' '.join(content)
print(f'{date}\n {title}\n {content}\n')
time.sleep(3)
browser.close()
Why did this code stop giving content part after a few articles? I don't understand it.
Thank you.
It's because You've reached your monthly free access limit
It's the message displayed on the webpage after a few page displayed.
I'm using selenium to click to the web page I want, and then parse the web page using Beautiful Soup.
Somebody has shown how to get inner HTML of an element in a Selenium WebDriver. Is there a way to get HTML of the whole page? Thanks
The sample code in Python
(Based on the post above, the language seems to not matter too much):
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
url = 'http://www.google.com'
driver = webdriver.Firefox()
driver.get(url)
the_html = driver---somehow----.get_attribute('innerHTML')
bs = BeautifulSoup(the_html, 'html.parser')
To get the HTML for the whole page:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://stackoverflow.com")
html = driver.page_source
To get the outer HTML (tag included):
# HTML from `<html>`
html = driver.execute_script("return document.documentElement.outerHTML;")
# HTML from `<body>`
html = driver.execute_script("return document.body.outerHTML;")
# HTML from element with some JavaScript
element = driver.find_element_by_css_selector("#hireme")
html = driver.execute_script("return arguments[0].outerHTML;", element)
# HTML from element with `get_attribute`
element = driver.find_element_by_css_selector("#hireme")
html = element.get_attribute('outerHTML')
To get the inner HTML (tag excluded):
# HTML from `<html>`
html = driver.execute_script("return document.documentElement.innerHTML;")
# HTML from `<body>`
html = driver.execute_script("return document.body.innerHTML;")
# HTML from element with some JavaScript
element = driver.find_element_by_css_selector("#hireme")
html = driver.execute_script("return arguments[0].innerHTML;", element)
# HTML from element with `get_attribute`
element = driver.find_element_by_css_selector("#hireme")
html = element.get_attribute('innerHTML')
driver.page_source probably outdated. Following worked for me
let html = await driver.getPageSource();
Reference: https://seleniumhq.github.io/selenium/docs/api/javascript/module/selenium-webdriver/ie_exports_Driver.html#getPageSource
Using page object in Java:
#FindBy(xpath = "xapth")
private WebElement element;
public String getInnnerHtml() {
System.out.println(waitUntilElementToBeClickable(element, 10).getAttribute("innerHTML"));
return waitUntilElementToBeClickable(element, 10).getAttribute("innerHTML")
}
A C# snippet for those of us who might want to copy / paste a bit of working code some day
var element = yourWebDriver.FindElement(By.TagName("html"));
string outerHTML = element.GetAttribute(nameof(outerHTML));
Thanks to those who answered before me. Anyone in the future who benefits from this snippet of C# that gets the HTML for any page element in a Selenium test, please consider up voting this answer or leaving a comment.
i am trying to scrape the following page:
https://www.dukascopy.com/swiss/english/marketwatch/sentiment/
more exactly, the numbers in the chart. for example, the number 74,19 % in the green bar next to the aud/usd text. i have inspected the elements and found out that the tag for this number is span. but the following code does not return this or any other number in the chart:
import requests
from bs4 import BeautifulSoup
r=requests.get('https://www.dukascopy.com/swiss/english/marketwatch/sentiment/')
soup = BeautifulSoup(r.content, "html.parser")
data = soup('span')
print(data)
So if you incorporate selenium with beautiful soup, you will get all the abilities of selenium to scrape iframes.
try this:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get(bond_iframe)
bond_source = browser.page_source
browser.quit()
soup = BeautifulSoup(bond_source,"html.parser")
for div in soup.findAll('div',attrs={'class':'qs-note-panel'}):
print div
The for loop would be which div tag you are searching for