How to get innerHTML of whole page in selenium driver?

How to get innerHTML of whole page in selenium driver? - selenium

I'm using selenium to click to the web page I want, and then parse the web page using Beautiful Soup.
Somebody has shown how to get inner HTML of an element in a Selenium WebDriver. Is there a way to get HTML of the whole page? Thanks
The sample code in Python
(Based on the post above, the language seems to not matter too much):
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
url = 'http://www.google.com'
driver = webdriver.Firefox()
driver.get(url)
the_html = driver---somehow----.get_attribute('innerHTML')
bs = BeautifulSoup(the_html, 'html.parser')

To get the HTML for the whole page:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://stackoverflow.com")
html = driver.page_source
To get the outer HTML (tag included):
# HTML from `<html>`
html = driver.execute_script("return document.documentElement.outerHTML;")
# HTML from `<body>`
html = driver.execute_script("return document.body.outerHTML;")
# HTML from element with some JavaScript
element = driver.find_element_by_css_selector("#hireme")
html = driver.execute_script("return arguments[0].outerHTML;", element)
# HTML from element with `get_attribute`
element = driver.find_element_by_css_selector("#hireme")
html = element.get_attribute('outerHTML')
To get the inner HTML (tag excluded):
# HTML from `<html>`
html = driver.execute_script("return document.documentElement.innerHTML;")
# HTML from `<body>`
html = driver.execute_script("return document.body.innerHTML;")
# HTML from element with some JavaScript
element = driver.find_element_by_css_selector("#hireme")
html = driver.execute_script("return arguments[0].innerHTML;", element)
# HTML from element with `get_attribute`
element = driver.find_element_by_css_selector("#hireme")
html = element.get_attribute('innerHTML')

driver.page_source probably outdated. Following worked for me
let html = await driver.getPageSource();
Reference: https://seleniumhq.github.io/selenium/docs/api/javascript/module/selenium-webdriver/ie_exports_Driver.html#getPageSource

Using page object in Java:
#FindBy(xpath = "xapth")
private WebElement element;
public String getInnnerHtml() {
System.out.println(waitUntilElementToBeClickable(element, 10).getAttribute("innerHTML"));
return waitUntilElementToBeClickable(element, 10).getAttribute("innerHTML")
}

A C# snippet for those of us who might want to copy / paste a bit of working code some day
var element = yourWebDriver.FindElement(By.TagName("html"));
string outerHTML = element.GetAttribute(nameof(outerHTML));
Thanks to those who answered before me. Anyone in the future who benefits from this snippet of C# that gets the HTML for any page element in a Selenium test, please consider up voting this answer or leaving a comment.

Related

Webdriver Selenium not loading new page after click()

I´m using selenium to scrape a webpage and it finds the elements on the main page, but when I use the click() function, the driver never finds the elements on the new page. I used beautifulSoup to see if it´s getting the html, but the html is always from the main. (When I see the driver window it shows that the page is opened).
html = driver.execute_script('return document.documentElement.outerHTML')
soup = bs.BeautifulSoup(html, 'html.parser')
print(soup.prettify)
I´ve used webDriverWait() to see if it´s not loading but even after 60 seconds it never does,
element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.ID, "ddlProducto")))
also execute_script() to check if by clicking the button using javascript loads the page, but it returns None when I print a variable saving the new page.
selectProducto = driver.execute_script("return document.getElementById('ddlProducto');")
print(selectProducto)
Also used chwd = driver.window_handles and driver.switch_to_window(chwd[1]) but it says that the index is out of range.
chwd = driver.window_handles
driver.switch_to.window(chwd[1])

Why not able to get the page and search form with login details (selenium, Beautifulsoup)

I want to scrape this this for some of my natural language processing work. I have a subscription to the website but still, I am not able to get the result. I got the error that unable to locate the element.
The link to login page is login
This is the code that I tried in python with selenium.
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option('useAutomationExtension', False)
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_argument("disable-infobars")
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver", options=options)
driver.get('https://login.newscorpaustralia.com/login?state=hKFo2SBmOXc1TjRJNDlBX3hObkZPN1NsRWgzcktONTlPVnJMS6FupWxvZ2luo3RpZNkgUi1ZRmV2Z2dwcWJmZUpqdWtZdk5CUUllX0h3YngwanSjY2lk2SAwdjlpN0tvVzZNQkxTZmUwMzZZU1FUNzl6QThaYXo0WQ&client=0v9i7KoW6MBLSfe036YSQT79zA8Zaz4Y&protocol=oauth2&response_type=token%20id_token&scope=openid%20profile&audience=newscorpaustralia&site=couriermail&redirect_uri=https%3A%2F%2Fwww.couriermail.com.au%2Fremote%2Fidentity%2Fauth%2Flatest%2Flogin%2Fcallback.html%3FredirectUri%3Dhttps%253A%252F%252Fwww.couriermail.com.au%252Fsearch-results%253Fq%253Djason%252520huny&prevent_sign_up=true&nonce=7j4grLXRD39EVhGsxcagsO5c-PtAY4Md&auth0Client=eyJuYW1lIjoiYXV0aDAuanMiLCJ2ZXJzaW9uIjoiOS4xOS4wIn0%3D')
time.sleep(10)
elem = driver.find_element(by=By.CLASS_NAME,value='navigation_search')
username = driver.find_element(by=By.ID,value='1-email')
password = driver.find_element(by=By.NAME,value='password')
login = driver.find_element(by=By.NAME,value='submit')
username.send_keys("myid");
password.send_keys("password");
login.click();
time.sleep(20)
soup = BeautifulSoup(driver.page_source, 'html.parser')
search = driver.find_element(by=By.CSS_SELECTOR,value='form.navigation_search')
search.click();
search.send_keys("jason hunt");
print(driver.page_source)
Below is the error that I am getting. I want to grab the search icon and send the keys there but I am not getting the search form after login.
Below is the text based HTML of the element.
I tried printing the page source and I was not able to locate the html element there too.

Not a proper answer, but since you can't add formatting to comments and this has the same desired effect:
driver.get("https://www.couriermail.com.au/search-results");
WebDriverWait(driver, timeout=10).until(lambda d: d.find_element(By.CLASS_NAME, "search_box_input"))
searchBox = driver.find_element(By.CLASS_NAME, "search_box_input")
searchBox.send_keys("test");

Python Selenium Webdriver doesn't scroll into div

I wrote a script to scrape data from SubGraph APIs. It simply click the run button and gets some output code. The problem is that I dhould scroll until the end of the page to get the full output code, unless I get it cutted. This is the way I tried:
def find_datasets():
datasets_url = []
s=Service(ChromeDriverManager().install())
options = Options()
options.headless = False
options.add_argument('window-size=800,600')
driver = webdriver.Chrome(service=s, options=options)
driver.get("https://v4.subgraph.polygon.oceanprotocol.com/subgraphs/name/oceanprotocol/ocean-subgraph/graphql?query=%7B%0A%20%20pools(orderBy%3A%20createdTimestamp%2C%20orderDirection%3A%20desc)%20%7B%0A%20%20%20%20id%0A%20%20%20%20datatoken%20%7B%0A%20%20%20%20%20%20address%0A%20%20%20%20%7D%0A%20%20%20%20publishMarketSwapFee%0A%20%20%20%20liquidityProviderSwapFee%0A%20%20%7D%0A%7D%0A")
sleep(15)
driver.find_element(by=By.XPATH, value="//button[contains(#class, 'execute-button')]").click()
sleep(8)
element = driver.find_elements(by=By.XPATH, value="//div[contains(#class, 'CodeMirror-lines')]")
driver.execute_script("arguments[0].scrollIntoView(true);", element);
print(driver.find_element(by=By.XPATH, value="//div[contains(#class, 'result-window')]").text)
driver.get_screenshot_as_file("screenshot.png")
What am I missing? Thank you for your patience.

have you tried with the class Actions ?
Example:
menu = driver.find_element(By.CSS_SELECTOR, ".nav")
hidden_submenu = driver.find_element(By.CSS_SELECTOR, ".nav #submenu1")
actions = ActionChains(driver)
actions.move_to_element(menu)
actions.click(hidden_submenu)
actions.perform()
Or I think .. you could try to put only
driver.execute_script("arguments[0].scrollIntoView();", element)

before do action on the target element, you need invoke scrollIntoView() first.

How To Scrape This Field

I want the following field, the "514" id. (Id is located in the first row of this webpage)
I tried using xpath with class name and then get attribute, but that prints blank.
Here is a screenshot of the tag in question
Screenshot
import time
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.abstractsonline.com/pp8/#!/10517/sessions/#timeSlot=Apr08/1')
page_source = driver.page_source
element = driver.find_elements_by_xpath('.//li[#class="result clearfix"]')
for el in element:
id=el.find_element_by_class_name('name').get_attribute("data-id")
print(id)

You can use find once.
by css - .result.clearfix .name
by xpath - .//*[#class='result clearfix']//*[#class='name']

sometimes selenium don't get inspection element

i'm trying to crawling this part
enter image description here
sometimes selenium got that part, but sometimes not.
But i don't figure out the reason why.
my code is below:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(r'C:(my path)\chromedriver.exe')
url = 'https://www.rocketpunch.com/companies?page=1'
driver.get(url)
driver.implicitly_wait(5)
html = driver.page_source
print(html)
soup = BeautifulSoup(html, 'lxml')
comments = soup.findAll('h4', {'class': 'header name'})
for comment in comments:
print(comment)

After the driver.implicitly_wait(5) you are using here just add some short fixed delay like time.sleep(1)
implicitly_wait waits until some element is found. It doesn't wait for all the elements / the entire page to be loaded.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to get innerHTML of whole page in selenium driver? - selenium

driver.page_source probably outdated. Following worked for me let html = await driver.getPageSource(); Reference: https://seleniumhq.github.io/selenium/docs/api/javascript/module/selenium-webdriver/ie_exports_Driver.html#getPageSource

Using page object in Java: #FindBy(xpath = "xapth") private WebElement element; public String getInnnerHtml() { System.out.println(waitUntilElementToBeClickable(element, 10).getAttribute("innerHTML")); return waitUntilElementToBeClickable(element, 10).getAttribute("innerHTML") }

Related

Webdriver Selenium not loading new page after click()

Why not able to get the page and search form with login details (selenium, Beautifulsoup)

Python Selenium Webdriver doesn't scroll into div

How To Scrape This Field

sometimes selenium don't get inspection element

Categories

Resources