Selenium (Python)- Webscraping verb-conjugation tables (Accessing web elements underneath '#document') - selenium

Section 0: Introduction:
This is my first webscraping project and I am not experienced in using selenium . I am trying to scrape arabic verb-conjugation tables from the website:
Online Sarf Generator
Any help with the following probelem will be great.
Thank you.
Section 1: The Problem:
I am trying to webscrape from the following website:
Online Sarf Generator
For doing this, I am trying to use Selenium.
I basically need to select the three root letters and the family from the four toggle menus as shown in the picture below:
After this, I have to click the 'Generate Sarf Table' button.
Section 2: My Attempt:
Here is my code:
#------------------ Just Setting Up the web_driver:
s = Service('/usr/local/bin/chromedriver')
# Set some selenium chrome options:
chromeOptions = Options()
# chromeOptions.headless = False
driver = webdriver.Chrome(service=s, options=chromeOptions)
driver.get('https://sites.google.com/view/sarfgenerator/home')
# I switch the frame once:
iframe = driver.find_elements(by=By.CSS_SELECTOR, value='iframe')[0]
driver.switch_to.frame(iframe)
# I switch the frame again:
iframe = driver.find_elements(by=By.CSS_SELECTOR, value='iframe')[0]
driver.switch_to.frame(iframe)
This takes me to the frame within which the webelements that I need are located.
Now, I print the html to see where I am at:
print(BeautifulSoup(driver.execute_script("return document.body.innerHTML;"),'html.parser'))
Here is the output that I get:
<iframe frameborder="0" id="userHtmlFrame" scrolling="yes">
</iframe>
<script>function loadGapi(){var loaderScript=document.createElement('script');loaderScript.setAttribute('src','https://apis.google.com/js/api.js?checkCookie=1');loaderScript.onload=function(){this.onload=function(){};loadGapiClient();};loaderScript.onreadystatechange=function(){if(this.readyState==='complete'){this.onload();}};(document.head||document.body||document.documentElement).appendChild(loaderScript);}function updateUserHtmlFrame(userHtml,enableInteraction,forceIosScrolling){var frame=document.getElementById('userHtmlFrame');if(enableInteraction){if(forceIosScrolling){var iframeParent=frame.parentElement;iframeParent.classList.add('forceIosScrolling');}else{frame.style.overflow='auto';}}else{frame.setAttribute('scrolling','no');frame.style.pointerEvents='none';}clearCookies();clearStorage();frame.contentWindow.document.open();frame.contentWindow.document.write('<base target="_blank">'+userHtml);frame.contentWindow.document.close();}function onGapiInitialized(){gapi.rpc.call('..','innerFrameGapiInitialized');gapi.rpc.register('updateUserHtmlFrame',updateUserHtmlFrame);}function loadGapiClient(){gapi.load('gapi.rpc',onGapiInitialized);}if(document.readyState=='complete'){loadGapi();}else{self.addEventListener('load',loadGapi);}function clearCookies(){var cookies=document.cookie.split(";");for(var i=0;i<cookies.length;i++){var cookie=cookies[i];var equalPosition=cookie.indexOf("=");var name=equalPosition>-1?cookie.substr(0,equalPosition):cookie;document.cookie=name+"=;expires=Thu, 01 Jan 1970 00:00:00 GMT";document.cookie=name+"=;expires=Thu, 01 Jan 1970 00:00:01 GMT ;domain=.googleusercontent.com";}}function clearStorage(){try{localStorage.clear();sessionStorage.clear();}catch(e){}}</script>
However, the actual html on the website looks like this:
Section 3: The main problem with my approach:
I am unable to access the anything #document contained within the iframe.
Section 4: Conclusion:
Is there a possible solution that can fix my current approach to the problem?
Is there any other way to solve the problem described in Section 1?

You put a lot of effort into structuring your question, so I couldn't not answer it, even if it meant double negation.
Here is how you can drill down into the iframe with content:
EDIT: here is how you can select some options, click the button and access the results:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver_linux64/chromedriver") ## path to where you saved chromedriver binary
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(driver, 25)
url = 'https://sites.google.com/view/sarfgenerator/home'
driver.get(url)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, '//*[#aria-label="Custom embed"]')))
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, '//*[#id="innerFrame"]')))
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, '//*[#id="userHtmlFrame"]')))
first_select = Select(wait.until(EC.element_to_be_clickable((By.XPATH, '//select[#id="root1"]'))))
second_select = Select(wait.until(EC.element_to_be_clickable((By.XPATH, '//select[#id="root2"]'))))
third_select = Select(wait.until(EC.element_to_be_clickable((By.XPATH, '//select[#id="root3"]'))))
first_select.select_by_visible_text("ج")
second_select.select_by_visible_text("ت")
third_select.select_by_visible_text("ص")
wait.until(EC.element_to_be_clickable((By.XPATH, ('//button[#onclick="sarfGenerator(false)"]')))).click()
print('clicked')
result = wait.until(EC.presence_of_element_located((By.XPATH, '//p[#id="demo"]')))
print(result.text)
Result printed in terminal:
clicked
جَتَّصَ يُجَتِّصُ تَجتِيصًا مُجَتِّصٌ
جُتِّصَ يُجَتَّصُ تَجتِيصًا مُجَتَّصٌ
جَتِّصْ لا تُجَتِّصْ مُجَتَّصٌ Highlight Root Letters
Selenium setup is for Linux, you just have to observe the imports, and the part after defining the driver.
Selenium documentation can be found here.

Related

Unable to find an exact match for CDP version 109, so returning the closest version found: 108 [duplicate]

Here is my code:
from selenium import webdriver
user = "someemail#email.com"
browser = webdriver.Chrome("/path/to/browser/")
browser.get("https://www.quora.com/")
username = browser.find_element_by_name("email")
browser.implicitly_wait(10)
username.send_keys(user)
Here is the error message:
selenium.common.exceptions.ElementNotInteractableException: Message: element not interactable
I think there is another thread with a similar issue. Either the solutions in that thread didn't work for me or I don't know how to implement the solutions.
find_element_by_name("email")
is present multiple times in DOM. So that wouldn't work.
You can try with this css selector :
input[class*='header_login_text_box'][name='email']
Code :
username = browser.find_element_by_css_selector("input[class*='header_login_text_box'][name='email']")
username.send_keys("user#gmail.com")
To send a character sequence to the Email field within Login section of Quora you need to induce WebDriverWait for the element to be clickable and you can use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--disable-extensions")
# options.add_argument('disable-infobars')
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("https://www.quora.com/")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[#class='title login_title' and text()='Login']//following::div[1]//input[#class='text header_login_text_box ignore_interaction']"))).send_keys("someemail#email.com")
Browser Snapshot:
As said in comment, the locator used returning two elements and required element is second one. driver trying to interact with first element, so exception is throwing.
good see in console, the locator returning required one or not.
> $$("[name='email']") (2) [input#__w2_wD9e9Qgz12_email.text, input#__w2_wD9e9Qgz18_email.text.header_login_text_box.ignore_interaction]
> 0: input#__w2_wD9e9Qgz12_email.text 1:
> input#__w2_wD9e9Qgz18_email.text.header_login_text_box.ignore_interaction
> length: 2
> __proto__: Array(0)
go for another locator, if not able to figure it out another locator, then comment, will help you.
from selenium import webdriver
user = "someemail#email.com"
browser = webdriver.Chrome("/path/to/browser/")
browser.get("https://www.quora.com/")
username = browser.find_element_by_xpath("//input[#class='text header_login_text_box ignore_interaction' and #type='text']")
browser.implicitly_wait(10)
username.send_keys(user)
Here You can find Why ElementNotInteractableException occurs.
If you are using the Select aproach like:
from selenium.webdriver.support.select import Select
try this
Select(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '''//*[#id="ReportViewer1_ctl04_ctl07_ddValue"]''')))).select_by_visible_text(str(x))

There is an element, why am I getting an NoSuchElementException?

problem:
Given the following code:
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=004000')
# move to my goal
browser.find_element_by_link_text("재무분석").click()
browser.find_element_by_link_text("재무상태표").click()
# extract the data
elem = browser.find_element_by_xpath("//*[#id='faRVArcVR1a2']/table[2]/tbody/tr[2]/td[6]")
print(elem.text)
I write this code to extract finance data.
At first, I just move to page which have wanting data.
And I copy the XPATH by Chrome Browser function.
But although there is 'text', I get faced NoSuchElementException.
Why this problem happen?
try to fix:
At first, I thought that 'is this happen because of the delay'?
Although there is almost no delay in my computer, I just try to fix it.
I add some import code and change 'elem' part:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Chrome()
browser.get('https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=004000')
# move to my goal
browser.find_element_by_link_text("재무분석").click()
browser.find_element_by_link_text("재무상태표").click()
# extract the data
elem = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="faRVArcVR1a2"]/table[2]/tbody/tr[2]/td[6]')))
print(elem.text)
but as a result, only TimeoutException happens..
I don't know why these problem happens. help pls! thank u..
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
browser = webdriver.Chrome()
browser.get('https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=004000')
# move to my goal
browser.find_element_by_link_text("재무분석").click()
browser.find_element_by_link_text("재무상태표").click()
elementXpath = '//table[#summary="IFRS연결 연간 재무 정보를 제공합니다."][2]/tbody/tr[2]/td[6]'
# extract the data
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, elementXpath)))
# Wait for the table to load
time.sleep(1)
elem = browser.find_element(by=By.XPATH, value=elementXpath)
print(elem.text)
There were several problems:
the ID of the div which wraps the table ("faRVArcVR1a2") changes every time you load the page, that's why this is not a proper way of finding the element. I changed that so that it is found by the summary of the table.
WebDriverWait doesn't return the element, that's why you have to get the element with find_element after you know it is present.
Even after you waited for the table to appear, you have to wait an additional second so that all cells of the table load. Otherwise you would get an empty string.

Selenium fails to load elements, despite EC, waits, and scrolling attempts

With the Selenium (3.141), BeautifulSoup (4.7.9), and Python (3.79), I'm trying to scrape what streaming, rental, and buying options are available for a given movie/show. I've spent hours trying to solve this, so any help would be appreciated. Apologies for the poor formatting, in terms of mixing in comments and prior attempts.
Example Link: https://www.justwatch.com/us/tv-show/24
Desired Outcome is a Beautiful soup element that I can then parse (e.g., which streaming services have it, how many seasons are available, etc.),
which has 3 elements (as of now) - Hulu, IMDB TV, and DirecTV.
I tried numerous variations, but only get one of the 3 streaming services for the example link, and even then it's not a consistent result. Often, I get an empty object.
Some of the things that I've tried included waiting for an expected condition (presence or visibility), explicitly using sleep() from the time library. I'm using a Mac (but running Linux via a USB), so there is no "PAGE DOWN" on the physical keyboard. For the keys module, I've tried control+arrow down, page down, and and space (space bar), but on this particular web page they don't work. However, if I'm browsing it in a normal fashion, control+arrow down and space bar help scrolling the desired section into view. As far as I know, there is no fn + arrow down option that works in Keys, but that's another way that I can move in a normal fashion.
I've run both headless and regular options to try to debug, as well as trying both Firefox and Chrome drivers.
Here's my code:
import time
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
firefox_options = Options()
firefox_options.add_argument('--enable-javascript') # double-checking to make sure that javascript is enabled
firefox_options.add_argument('--headless')
firefox_driver_path = 'geckodriver'
driver = webdriver.Firefox(executable_path=firefox_driver_path, options=firefox_options)
url_link = 'https://www.justwatch.com/us/tv-show/24'
driver.get(url_link) # initial page
cookies = driver.get_cookies()
Examples of things I've tried around this part of the code
various time.sleep(3) and driver.implicitly_wait(3) commands
webdriver.ActionChains(driver).key_down(Keys.CONTROL).key_down(Keys.ARROW_DOWN).perform()
webdriver.ActionChains(driver).key_down(Keys.SPACE).perform()
This code yields a timeout error when used
stream_results = WebDriverWait(driver, 15)
stream_results.until(EC.presence_of_element_located(
(By.CLASS_NAME, "price-comparison__grid__row price-comparison__grid__row--stream")))
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser') # 'lxml' didn't work either
Here's code for getting the html related to the streaming services. I've also tried to grab the html code at various levels, ids, and classes of the tree, but the code just isn't there
stream_row = soup.find('div', attrs={'class':'price-comparison__grid__row price-comparison__grid__row--stream'})
stream_row_holder = soup.find('div', attrs={'class':'price-comparison__grid__row__holder'})
stream_items = stream_row_holder\
.find_all('div', attrs={'class':'price-comparison__grid__row__element__icon'})
driver.quit()
I'm not sure if you are saying your code works in some cases or not at all, but I use chrome and the four find_all() lines at the end all produce results. If this isn't what you mean, let me know. The one thing you may be missing is a time.sleep() that is long enough. That could be the only difference...
Note you need chromedriver to run this code, but perhaps you have chrome and can download chromedriver.exe.
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
url = 'https://www.justwatch.com/us/tv-show/24'
driver = webdriver.Chrome('chromedriver.exe', options=chrome_options)
driver.get(url)
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
soup.find_all(class_="price-comparison__grid__row__price")
soup.find_all(class_="price-comparison__grid__row__holder")
soup.find_all(class_="price-comparison__grid__row__element__icon")
soup.find_all(class_="price-comparison__grid__row--stream")
This is the output from the last line:
[<div class="price-comparison__grid__row price-comparison__grid__row--stream"><div class="price-comparison__grid__row__title price-comparison__promoted__title"> Stream </div><div class="price-comparison__grid__row__holder"><!-- --><div class="price-comparison__grid__row__element"><div class="presentation-type price-comparison__grid__row__element__icon"><img alt="Hulu" class="jw-provider-icon price-comparison__grid__row__icon" src="https://images.justwatch.com/icon/116305230/s100" title="Hulu"/><div class="price-comparison__grid__row__price"> 9 Seasons <span class="price-comparison__badge price-comparison__badge--hd price-comparison__badge--hd"> HD </span></div></div></div><!-- --></div></div>,
<div class="price-comparison__grid__row price-comparison__grid__row--stream"><div class="price-comparison__grid__row__title"> Stream </div><div class="price-comparison__grid__row__holder"><!-- --><div class="price-comparison__grid__row__element"><div class="presentation-type price-comparison__grid__row__element__icon"><img alt="Hulu" class="jw-provider-icon price-comparison__grid__row__icon" src="https://images.justwatch.com/icon/116305230/s100" title="Hulu"/><div class="price-comparison__grid__row__price"> 9 Seasons <span class="price-comparison__badge price-comparison__badge--hd price-comparison__badge--hd"> HD </span></div></div></div><div class="price-comparison__grid__row__element"><div class="presentation-type price-comparison__grid__row__element__icon"><img alt="IMDb TV" class="jw-provider-icon price-comparison__grid__row__icon" src="https://images.justwatch.com/icon/134049674/s100" title="IMDb TV"/><div class="price-comparison__grid__row__price"> 8 Seasons <!-- --></div></div></div><div class="price-comparison__grid__row__element"><div class="presentation-type price-comparison__grid__row__element__icon"><img alt="DIRECTV" class="jw-provider-icon price-comparison__grid__row__icon" src="https://images.justwatch.com/icon/158260222/s100" title="DIRECTV"/><div class="price-comparison__grid__row__price"> 1 Season <span class="price-comparison__badge price-comparison__badge--hd price-comparison__badge--hd"> HD </span></div></div></div><!-- --></div></div>]

Selenium can't find search element inside form

I'm trying to use selenium to perform searches in lexisnexis and I can't get it to find the search box.
I've tried find_element_by using all possible attributes and I only get the "NoSuchElementException: Message: no such element: Unable to locate element: " error every time.
See screenshot of the inspection tab -- the highlighted part is the element I need
My code:
from selenium import webdriver
import numpy as np
import pandas as pd
searchTerms = r'something'
url = r'https://www.lexisnexis.com/uk/legal/news' # this is the page after login - not including the code for login here.
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
browser.get(url)
I tried everything:
browser.find_element_by_id('search-query')
browser.find_element_by_xpath('//*[#id="search-query"]')
browser.find_element_by_xpath('/html/body/div/header/div/form/div[2]/input')
etc..
Nothing works. Any suggestions?
Could be possible your site is taking to long to load , in such cases you can use waits to avoid synchronization issue.
wait = WebDriverWait(driver, 10)
inputBox = wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='search-query']")))
Note : Add below imports to your solution
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element:

I'm trying to automatically generate lots of users on the webpage kahoot.it using selenium to make them appear in front of the class, however, I get this error message when trying to access the inputSession item (where you write the gameID to enter the game)
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.kahoot.it")
gameID = driver.find_element_by_id("inputSession")
username = driver.find_element_by_id("username")
gameID.send_keys("53384")
This is the error:
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element:
{"method":"id","selector":"inputSession"}
Looks like it takes time to load the webpage, and hence the detection of webelement wasn't happening. You can either use #shri's code above or just add these two statements just below the code driver = webdriver.Firefox():
driver.maximize_window() # For maximizing window
driver.implicitly_wait(20) # gives an implicit wait for 20 seconds
Could be a race condition where the find element is executing before it is present on the page. Take a look at the wait timeout documentation. Here is an example from the docs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))
)
finally:
driver.quit()
In my case, the error was caused by the element I was looking for being inside an iframe. This meant I had to change frame before looking for the element:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.google.co.uk/maps")
frame_0 = driver.find_element_by_class_name('widget-consent-frame')
driver.switch_to.frame(frame_0)
agree_btn_0 = driver.find_element_by_id('introAgreeButton')
agree_btn_0.click()
Reddit source
You can also use below as an alternative to the above two solutions:
import time
time.sleep(30)
this worked for me (the try/finally didn't, kept hitting the finally/browser.close())
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('mywebsite.com')
username = None
while(username == None):
username = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "username"))
)
username.send_keys('myusername#email.com')
It seems that your browser did not read proper HTML texts/tags, use a delay function that'll help the page to load first and then get all tags from the page.
driver = webdriver.Chrome('./chromedriver.exe')
# load the page
driver.get('https://www.instagram.com/accounts/login/')
# use delay function to get all tags
driver.implicitly_wait(20)
# access tag
driver.find_element_by_name('username').send_keys(self.username)
Also for some, it may be due to opening the new tabs when clicking the button(not in this question particularly). Then you can switch tabs by command.
driver.switch_to.window(driver.window_handles[1]) #for switching to second tab
I had the same problem as you and this solution saved me:
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID, 'someid')))
it just means the function is executing before button can be clicked. Example solution:
from selenium import sleep
# load the page first and then pause
sleep(3)
# pauses executing the next line for 3 seconds