I'm using selenium to click to the web page I want, and then parse the web page using Beautiful Soup.
Somebody has shown how to get inner HTML of an element in a Selenium WebDriver. Is there a way to get HTML of the whole page? Thanks
The sample code in Python
(Based on the post above, the language seems to not matter too much):
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
url = 'http://www.google.com'
driver = webdriver.Firefox()
driver.get(url)
the_html = driver---somehow----.get_attribute('innerHTML')
bs = BeautifulSoup(the_html, 'html.parser')
To get the HTML for the whole page:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://stackoverflow.com")
html = driver.page_source
To get the outer HTML (tag included):
# HTML from `<html>`
html = driver.execute_script("return document.documentElement.outerHTML;")
# HTML from `<body>`
html = driver.execute_script("return document.body.outerHTML;")
# HTML from element with some JavaScript
element = driver.find_element_by_css_selector("#hireme")
html = driver.execute_script("return arguments[0].outerHTML;", element)
# HTML from element with `get_attribute`
element = driver.find_element_by_css_selector("#hireme")
html = element.get_attribute('outerHTML')
To get the inner HTML (tag excluded):
# HTML from `<html>`
html = driver.execute_script("return document.documentElement.innerHTML;")
# HTML from `<body>`
html = driver.execute_script("return document.body.innerHTML;")
# HTML from element with some JavaScript
element = driver.find_element_by_css_selector("#hireme")
html = driver.execute_script("return arguments[0].innerHTML;", element)
# HTML from element with `get_attribute`
element = driver.find_element_by_css_selector("#hireme")
html = element.get_attribute('innerHTML')
driver.page_source probably outdated. Following worked for me
let html = await driver.getPageSource();
Reference: https://seleniumhq.github.io/selenium/docs/api/javascript/module/selenium-webdriver/ie_exports_Driver.html#getPageSource
Using page object in Java:
#FindBy(xpath = "xapth")
private WebElement element;
public String getInnnerHtml() {
System.out.println(waitUntilElementToBeClickable(element, 10).getAttribute("innerHTML"));
return waitUntilElementToBeClickable(element, 10).getAttribute("innerHTML")
}
A C# snippet for those of us who might want to copy / paste a bit of working code some day
var element = yourWebDriver.FindElement(By.TagName("html"));
string outerHTML = element.GetAttribute(nameof(outerHTML));
Thanks to those who answered before me. Anyone in the future who benefits from this snippet of C# that gets the HTML for any page element in a Selenium test, please consider up voting this answer or leaving a comment.
Related
I´m using selenium to scrape a webpage and it finds the elements on the main page, but when I use the click() function, the driver never finds the elements on the new page. I used beautifulSoup to see if it´s getting the html, but the html is always from the main. (When I see the driver window it shows that the page is opened).
html = driver.execute_script('return document.documentElement.outerHTML')
soup = bs.BeautifulSoup(html, 'html.parser')
print(soup.prettify)
I´ve used webDriverWait() to see if it´s not loading but even after 60 seconds it never does,
element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.ID, "ddlProducto")))
also execute_script() to check if by clicking the button using javascript loads the page, but it returns None when I print a variable saving the new page.
selectProducto = driver.execute_script("return document.getElementById('ddlProducto');")
print(selectProducto)
Also used chwd = driver.window_handles and driver.switch_to_window(chwd[1]) but it says that the index is out of range.
chwd = driver.window_handles
driver.switch_to.window(chwd[1])
I want to scrape this this for some of my natural language processing work. I have a subscription to the website but still, I am not able to get the result. I got the error that unable to locate the element.
The link to login page is login
This is the code that I tried in python with selenium.
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option('useAutomationExtension', False)
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_argument("disable-infobars")
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver", options=options)
driver.get('https://login.newscorpaustralia.com/login?state=hKFo2SBmOXc1TjRJNDlBX3hObkZPN1NsRWgzcktONTlPVnJMS6FupWxvZ2luo3RpZNkgUi1ZRmV2Z2dwcWJmZUpqdWtZdk5CUUllX0h3YngwanSjY2lk2SAwdjlpN0tvVzZNQkxTZmUwMzZZU1FUNzl6QThaYXo0WQ&client=0v9i7KoW6MBLSfe036YSQT79zA8Zaz4Y&protocol=oauth2&response_type=token%20id_token&scope=openid%20profile&audience=newscorpaustralia&site=couriermail&redirect_uri=https%3A%2F%2Fwww.couriermail.com.au%2Fremote%2Fidentity%2Fauth%2Flatest%2Flogin%2Fcallback.html%3FredirectUri%3Dhttps%253A%252F%252Fwww.couriermail.com.au%252Fsearch-results%253Fq%253Djason%252520huny&prevent_sign_up=true&nonce=7j4grLXRD39EVhGsxcagsO5c-PtAY4Md&auth0Client=eyJuYW1lIjoiYXV0aDAuanMiLCJ2ZXJzaW9uIjoiOS4xOS4wIn0%3D')
time.sleep(10)
elem = driver.find_element(by=By.CLASS_NAME,value='navigation_search')
username = driver.find_element(by=By.ID,value='1-email')
password = driver.find_element(by=By.NAME,value='password')
login = driver.find_element(by=By.NAME,value='submit')
username.send_keys("myid");
password.send_keys("password");
login.click();
time.sleep(20)
soup = BeautifulSoup(driver.page_source, 'html.parser')
search = driver.find_element(by=By.CSS_SELECTOR,value='form.navigation_search')
search.click();
search.send_keys("jason hunt");
print(driver.page_source)
Below is the error that I am getting. I want to grab the search icon and send the keys there but I am not getting the search form after login.
Below is the text based HTML of the element.
I tried printing the page source and I was not able to locate the html element there too.
Not a proper answer, but since you can't add formatting to comments and this has the same desired effect:
driver.get("https://www.couriermail.com.au/search-results");
WebDriverWait(driver, timeout=10).until(lambda d: d.find_element(By.CLASS_NAME, "search_box_input"))
searchBox = driver.find_element(By.CLASS_NAME, "search_box_input")
searchBox.send_keys("test");
I wrote a script to scrape data from SubGraph APIs. It simply click the run button and gets some output code. The problem is that I dhould scroll until the end of the page to get the full output code, unless I get it cutted. This is the way I tried:
def find_datasets():
datasets_url = []
s=Service(ChromeDriverManager().install())
options = Options()
options.headless = False
options.add_argument('window-size=800,600')
driver = webdriver.Chrome(service=s, options=options)
driver.get("https://v4.subgraph.polygon.oceanprotocol.com/subgraphs/name/oceanprotocol/ocean-subgraph/graphql?query=%7B%0A%20%20pools(orderBy%3A%20createdTimestamp%2C%20orderDirection%3A%20desc)%20%7B%0A%20%20%20%20id%0A%20%20%20%20datatoken%20%7B%0A%20%20%20%20%20%20address%0A%20%20%20%20%7D%0A%20%20%20%20publishMarketSwapFee%0A%20%20%20%20liquidityProviderSwapFee%0A%20%20%7D%0A%7D%0A")
sleep(15)
driver.find_element(by=By.XPATH, value="//button[contains(#class, 'execute-button')]").click()
sleep(8)
element = driver.find_elements(by=By.XPATH, value="//div[contains(#class, 'CodeMirror-lines')]")
driver.execute_script("arguments[0].scrollIntoView(true);", element);
print(driver.find_element(by=By.XPATH, value="//div[contains(#class, 'result-window')]").text)
driver.get_screenshot_as_file("screenshot.png")
What am I missing? Thank you for your patience.
have you tried with the class Actions ?
Example:
menu = driver.find_element(By.CSS_SELECTOR, ".nav")
hidden_submenu = driver.find_element(By.CSS_SELECTOR, ".nav #submenu1")
actions = ActionChains(driver)
actions.move_to_element(menu)
actions.click(hidden_submenu)
actions.perform()
Or I think .. you could try to put only
driver.execute_script("arguments[0].scrollIntoView();", element)
before do action on the target element, you need invoke scrollIntoView() first.
I want the following field, the "514" id. (Id is located in the first row of this webpage)
I tried using xpath with class name and then get attribute, but that prints blank.
Here is a screenshot of the tag in question
Screenshot
import time
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.abstractsonline.com/pp8/#!/10517/sessions/#timeSlot=Apr08/1')
page_source = driver.page_source
element = driver.find_elements_by_xpath('.//li[#class="result clearfix"]')
for el in element:
id=el.find_element_by_class_name('name').get_attribute("data-id")
print(id)
You can use find once.
by css - .result.clearfix .name
by xpath - .//*[#class='result clearfix']//*[#class='name']
i'm trying to crawling this part
enter image description here
sometimes selenium got that part, but sometimes not.
But i don't figure out the reason why.
my code is below:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(r'C:(my path)\chromedriver.exe')
url = 'https://www.rocketpunch.com/companies?page=1'
driver.get(url)
driver.implicitly_wait(5)
html = driver.page_source
print(html)
soup = BeautifulSoup(html, 'lxml')
comments = soup.findAll('h4', {'class': 'header name'})
for comment in comments:
print(comment)
After the driver.implicitly_wait(5) you are using here just add some short fixed delay like time.sleep(1)
implicitly_wait waits until some element is found. It doesn't wait for all the elements / the entire page to be loaded.