Selenium Problem extracting Google business description - selenium

I seem to be struggling with this issue for a couple of days and could really use some help. I am trying to scrape Google busineses information with Python beautifulsoups and Selenium and I want to extract the business description that is available for some of them:
As you can see not all of the text is shown so I need to click “More”. That is where the problem comes, no matter what I do I can’t seem to click it. I tried:
Waiting after I get url with Selenium so elements load
Getting element by class
Getting element by xpath
Clicking element via js executed code
Checking if element is in iframe(seems like it is not)
Setting browser to max size, setting browser headless option on and of
Switching between Firefox and Chrome
EDIT: Code I tried using:
url = 'https://www.google.com/search?q=' + quote(''.join(company) + ' ' + ''.join(location))
chrome_options = webdriver.FirefoxOptions()
chrome_options.headless = True
chrome_options.add_argument("--lang=en-GB")
chrome_options.add_argument("--window-size=1100,1000")
chrome_options.add_argument('--user-agent="Mozilla/5.0 (Windows Phone 10.0; Android 4.2.1; Microsoft; Lumia 640 XL LTE) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Mobile Safari/537.36 Edge/12.10166"')
browser = webdriver.Firefox(executable_path='C:/geckodriver.exe', options=chrome_options)
from selenium.webdriver.support import expected_conditions as EC
browser.maximize_window()
wait = WebDriverWait(browser, 10)
browser.get(url) # open a new tab in the new window
wait.until(EC.element_to_be_clickable((By.XPATH, "//a[#href and .='More']"))).click()
# browser.find_element_by_class_name('bJpcZ').click()
html = browser.page_source
browser.close()
soup = BeautifulSoup(html, 'lxml')
If anyone feels like he/she knows a solution please pass it over :)

driver.maximize_window()
wait=WebDriverWait(driver,10)
driver.get("https://www.google.com/search?rlz=1C1NDCM_enCA792CA792&sxsrf=APq-WBsY3Q1E1ge_7PuFaovaxQ_Orvk8-w:1645162032562&q=dungeness+pest+control&spell=1&sa=X&ved=2ahUKEwjfuLGUwoj2AhUaHDQIHUtpCOkQBSgAegQIARAy&biw=1366&bih=663&dpr=1") # open a new tab in the new window
wait.until(EC.element_to_be_clickable((By.XPATH,"//a[#href and .='More']"))).click()
Simply click the a tag with the text more.
There is a //div[#data-long-text] however where you could just .get_attribute("data-long-text") instead.
Import:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

Related

Selenium (Python)- Webscraping verb-conjugation tables (Accessing web elements underneath '#document')

Section 0: Introduction:
This is my first webscraping project and I am not experienced in using selenium . I am trying to scrape arabic verb-conjugation tables from the website:
Online Sarf Generator
Any help with the following probelem will be great.
Thank you.
Section 1: The Problem:
I am trying to webscrape from the following website:
Online Sarf Generator
For doing this, I am trying to use Selenium.
I basically need to select the three root letters and the family from the four toggle menus as shown in the picture below:
After this, I have to click the 'Generate Sarf Table' button.
Section 2: My Attempt:
Here is my code:
#------------------ Just Setting Up the web_driver:
s = Service('/usr/local/bin/chromedriver')
# Set some selenium chrome options:
chromeOptions = Options()
# chromeOptions.headless = False
driver = webdriver.Chrome(service=s, options=chromeOptions)
driver.get('https://sites.google.com/view/sarfgenerator/home')
# I switch the frame once:
iframe = driver.find_elements(by=By.CSS_SELECTOR, value='iframe')[0]
driver.switch_to.frame(iframe)
# I switch the frame again:
iframe = driver.find_elements(by=By.CSS_SELECTOR, value='iframe')[0]
driver.switch_to.frame(iframe)
This takes me to the frame within which the webelements that I need are located.
Now, I print the html to see where I am at:
print(BeautifulSoup(driver.execute_script("return document.body.innerHTML;"),'html.parser'))
Here is the output that I get:
<iframe frameborder="0" id="userHtmlFrame" scrolling="yes">
</iframe>
<script>function loadGapi(){var loaderScript=document.createElement('script');loaderScript.setAttribute('src','https://apis.google.com/js/api.js?checkCookie=1');loaderScript.onload=function(){this.onload=function(){};loadGapiClient();};loaderScript.onreadystatechange=function(){if(this.readyState==='complete'){this.onload();}};(document.head||document.body||document.documentElement).appendChild(loaderScript);}function updateUserHtmlFrame(userHtml,enableInteraction,forceIosScrolling){var frame=document.getElementById('userHtmlFrame');if(enableInteraction){if(forceIosScrolling){var iframeParent=frame.parentElement;iframeParent.classList.add('forceIosScrolling');}else{frame.style.overflow='auto';}}else{frame.setAttribute('scrolling','no');frame.style.pointerEvents='none';}clearCookies();clearStorage();frame.contentWindow.document.open();frame.contentWindow.document.write('<base target="_blank">'+userHtml);frame.contentWindow.document.close();}function onGapiInitialized(){gapi.rpc.call('..','innerFrameGapiInitialized');gapi.rpc.register('updateUserHtmlFrame',updateUserHtmlFrame);}function loadGapiClient(){gapi.load('gapi.rpc',onGapiInitialized);}if(document.readyState=='complete'){loadGapi();}else{self.addEventListener('load',loadGapi);}function clearCookies(){var cookies=document.cookie.split(";");for(var i=0;i<cookies.length;i++){var cookie=cookies[i];var equalPosition=cookie.indexOf("=");var name=equalPosition>-1?cookie.substr(0,equalPosition):cookie;document.cookie=name+"=;expires=Thu, 01 Jan 1970 00:00:00 GMT";document.cookie=name+"=;expires=Thu, 01 Jan 1970 00:00:01 GMT ;domain=.googleusercontent.com";}}function clearStorage(){try{localStorage.clear();sessionStorage.clear();}catch(e){}}</script>
However, the actual html on the website looks like this:
Section 3: The main problem with my approach:
I am unable to access the anything #document contained within the iframe.
Section 4: Conclusion:
Is there a possible solution that can fix my current approach to the problem?
Is there any other way to solve the problem described in Section 1?
You put a lot of effort into structuring your question, so I couldn't not answer it, even if it meant double negation.
Here is how you can drill down into the iframe with content:
EDIT: here is how you can select some options, click the button and access the results:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver_linux64/chromedriver") ## path to where you saved chromedriver binary
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(driver, 25)
url = 'https://sites.google.com/view/sarfgenerator/home'
driver.get(url)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, '//*[#aria-label="Custom embed"]')))
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, '//*[#id="innerFrame"]')))
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, '//*[#id="userHtmlFrame"]')))
first_select = Select(wait.until(EC.element_to_be_clickable((By.XPATH, '//select[#id="root1"]'))))
second_select = Select(wait.until(EC.element_to_be_clickable((By.XPATH, '//select[#id="root2"]'))))
third_select = Select(wait.until(EC.element_to_be_clickable((By.XPATH, '//select[#id="root3"]'))))
first_select.select_by_visible_text("ج")
second_select.select_by_visible_text("ت")
third_select.select_by_visible_text("ص")
wait.until(EC.element_to_be_clickable((By.XPATH, ('//button[#onclick="sarfGenerator(false)"]')))).click()
print('clicked')
result = wait.until(EC.presence_of_element_located((By.XPATH, '//p[#id="demo"]')))
print(result.text)
Result printed in terminal:
clicked
جَتَّصَ يُجَتِّصُ تَجتِيصًا مُجَتِّصٌ
جُتِّصَ يُجَتَّصُ تَجتِيصًا مُجَتَّصٌ
جَتِّصْ لا تُجَتِّصْ مُجَتَّصٌ Highlight Root Letters
Selenium setup is for Linux, you just have to observe the imports, and the part after defining the driver.
Selenium documentation can be found here.

find_element_by_name selenium python question

I'm a beginner trying to use selenium to automate browser interactions through an undetectable chrome browser.
The code i've done so far is below (you have no idea the time i've sunk into 5 lines).
I've tried so many iterations of the same code that I've lost sanity. This SHOULD work?
This is almost copied exactly from a youtube video now, there were some other ideas that youtubers did use but I didn't understand the coding so I haven't touched them. Anything #'d can be ignored or assumed that i've played with it and failed.
import autogui, sys, time, webbrowser, selenium
import undetected_chromedriver.v2 as uc
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common import action_chains
#Open Browser and visit website.
driver = uc.Chrome()
driver.get('https://www.iqrpg.com/game.html')
time.sleep(5)
#Complete username and password fields
userN = 'Gary'
passW = 'Barry'
find_element_by_name('login_username').send_keys(userN)
#find_element_by_name('login_password').send_keys(passW)
#driver.find_element_by_css_selector("input[type=\"submit"
#userField =
#passField = driver.find_element(By.ID, "passwd-id")
#search_box = driver.find_element_by_name('Battling')
#search_box.send_keys('ChromeDriver')
#search_box.submit()
#time.sleep(5)
1.Expecting the browser to open, forcibly logging you out due to selenium chrome
2. select name='login_username', send key the string saved under userN
3. same for password
4. click login (not yet coded, but plans)
In the latest Selenium with Python, you cannot use - driver.find_element_by_name.
Instead, you have to use: driver.find_element
from selenium.webdriver.common.by import By
driver.find_element(By.NAME, "login_username").send_keys("userN")
driver.find_element(By.NAME, "login_password").send_keys("passW")

Selenium fails to load elements, despite EC, waits, and scrolling attempts

With the Selenium (3.141), BeautifulSoup (4.7.9), and Python (3.79), I'm trying to scrape what streaming, rental, and buying options are available for a given movie/show. I've spent hours trying to solve this, so any help would be appreciated. Apologies for the poor formatting, in terms of mixing in comments and prior attempts.
Example Link: https://www.justwatch.com/us/tv-show/24
Desired Outcome is a Beautiful soup element that I can then parse (e.g., which streaming services have it, how many seasons are available, etc.),
which has 3 elements (as of now) - Hulu, IMDB TV, and DirecTV.
I tried numerous variations, but only get one of the 3 streaming services for the example link, and even then it's not a consistent result. Often, I get an empty object.
Some of the things that I've tried included waiting for an expected condition (presence or visibility), explicitly using sleep() from the time library. I'm using a Mac (but running Linux via a USB), so there is no "PAGE DOWN" on the physical keyboard. For the keys module, I've tried control+arrow down, page down, and and space (space bar), but on this particular web page they don't work. However, if I'm browsing it in a normal fashion, control+arrow down and space bar help scrolling the desired section into view. As far as I know, there is no fn + arrow down option that works in Keys, but that's another way that I can move in a normal fashion.
I've run both headless and regular options to try to debug, as well as trying both Firefox and Chrome drivers.
Here's my code:
import time
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
firefox_options = Options()
firefox_options.add_argument('--enable-javascript') # double-checking to make sure that javascript is enabled
firefox_options.add_argument('--headless')
firefox_driver_path = 'geckodriver'
driver = webdriver.Firefox(executable_path=firefox_driver_path, options=firefox_options)
url_link = 'https://www.justwatch.com/us/tv-show/24'
driver.get(url_link) # initial page
cookies = driver.get_cookies()
Examples of things I've tried around this part of the code
various time.sleep(3) and driver.implicitly_wait(3) commands
webdriver.ActionChains(driver).key_down(Keys.CONTROL).key_down(Keys.ARROW_DOWN).perform()
webdriver.ActionChains(driver).key_down(Keys.SPACE).perform()
This code yields a timeout error when used
stream_results = WebDriverWait(driver, 15)
stream_results.until(EC.presence_of_element_located(
(By.CLASS_NAME, "price-comparison__grid__row price-comparison__grid__row--stream")))
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser') # 'lxml' didn't work either
Here's code for getting the html related to the streaming services. I've also tried to grab the html code at various levels, ids, and classes of the tree, but the code just isn't there
stream_row = soup.find('div', attrs={'class':'price-comparison__grid__row price-comparison__grid__row--stream'})
stream_row_holder = soup.find('div', attrs={'class':'price-comparison__grid__row__holder'})
stream_items = stream_row_holder\
.find_all('div', attrs={'class':'price-comparison__grid__row__element__icon'})
driver.quit()
I'm not sure if you are saying your code works in some cases or not at all, but I use chrome and the four find_all() lines at the end all produce results. If this isn't what you mean, let me know. The one thing you may be missing is a time.sleep() that is long enough. That could be the only difference...
Note you need chromedriver to run this code, but perhaps you have chrome and can download chromedriver.exe.
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
url = 'https://www.justwatch.com/us/tv-show/24'
driver = webdriver.Chrome('chromedriver.exe', options=chrome_options)
driver.get(url)
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
soup.find_all(class_="price-comparison__grid__row__price")
soup.find_all(class_="price-comparison__grid__row__holder")
soup.find_all(class_="price-comparison__grid__row__element__icon")
soup.find_all(class_="price-comparison__grid__row--stream")
This is the output from the last line:
[<div class="price-comparison__grid__row price-comparison__grid__row--stream"><div class="price-comparison__grid__row__title price-comparison__promoted__title"> Stream </div><div class="price-comparison__grid__row__holder"><!-- --><div class="price-comparison__grid__row__element"><div class="presentation-type price-comparison__grid__row__element__icon"><img alt="Hulu" class="jw-provider-icon price-comparison__grid__row__icon" src="https://images.justwatch.com/icon/116305230/s100" title="Hulu"/><div class="price-comparison__grid__row__price"> 9 Seasons <span class="price-comparison__badge price-comparison__badge--hd price-comparison__badge--hd"> HD </span></div></div></div><!-- --></div></div>,
<div class="price-comparison__grid__row price-comparison__grid__row--stream"><div class="price-comparison__grid__row__title"> Stream </div><div class="price-comparison__grid__row__holder"><!-- --><div class="price-comparison__grid__row__element"><div class="presentation-type price-comparison__grid__row__element__icon"><img alt="Hulu" class="jw-provider-icon price-comparison__grid__row__icon" src="https://images.justwatch.com/icon/116305230/s100" title="Hulu"/><div class="price-comparison__grid__row__price"> 9 Seasons <span class="price-comparison__badge price-comparison__badge--hd price-comparison__badge--hd"> HD </span></div></div></div><div class="price-comparison__grid__row__element"><div class="presentation-type price-comparison__grid__row__element__icon"><img alt="IMDb TV" class="jw-provider-icon price-comparison__grid__row__icon" src="https://images.justwatch.com/icon/134049674/s100" title="IMDb TV"/><div class="price-comparison__grid__row__price"> 8 Seasons <!-- --></div></div></div><div class="price-comparison__grid__row__element"><div class="presentation-type price-comparison__grid__row__element__icon"><img alt="DIRECTV" class="jw-provider-icon price-comparison__grid__row__icon" src="https://images.justwatch.com/icon/158260222/s100" title="DIRECTV"/><div class="price-comparison__grid__row__price"> 1 Season <span class="price-comparison__badge price-comparison__badge--hd price-comparison__badge--hd"> HD </span></div></div></div><!-- --></div></div>]

Selenium can't find search element inside form

I'm trying to use selenium to perform searches in lexisnexis and I can't get it to find the search box.
I've tried find_element_by using all possible attributes and I only get the "NoSuchElementException: Message: no such element: Unable to locate element: " error every time.
See screenshot of the inspection tab -- the highlighted part is the element I need
My code:
from selenium import webdriver
import numpy as np
import pandas as pd
searchTerms = r'something'
url = r'https://www.lexisnexis.com/uk/legal/news' # this is the page after login - not including the code for login here.
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
browser.get(url)
I tried everything:
browser.find_element_by_id('search-query')
browser.find_element_by_xpath('//*[#id="search-query"]')
browser.find_element_by_xpath('/html/body/div/header/div/form/div[2]/input')
etc..
Nothing works. Any suggestions?
Could be possible your site is taking to long to load , in such cases you can use waits to avoid synchronization issue.
wait = WebDriverWait(driver, 10)
inputBox = wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='search-query']")))
Note : Add below imports to your solution
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element:

I'm trying to automatically generate lots of users on the webpage kahoot.it using selenium to make them appear in front of the class, however, I get this error message when trying to access the inputSession item (where you write the gameID to enter the game)
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.kahoot.it")
gameID = driver.find_element_by_id("inputSession")
username = driver.find_element_by_id("username")
gameID.send_keys("53384")
This is the error:
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element:
{"method":"id","selector":"inputSession"}
Looks like it takes time to load the webpage, and hence the detection of webelement wasn't happening. You can either use #shri's code above or just add these two statements just below the code driver = webdriver.Firefox():
driver.maximize_window() # For maximizing window
driver.implicitly_wait(20) # gives an implicit wait for 20 seconds
Could be a race condition where the find element is executing before it is present on the page. Take a look at the wait timeout documentation. Here is an example from the docs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))
)
finally:
driver.quit()
In my case, the error was caused by the element I was looking for being inside an iframe. This meant I had to change frame before looking for the element:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.google.co.uk/maps")
frame_0 = driver.find_element_by_class_name('widget-consent-frame')
driver.switch_to.frame(frame_0)
agree_btn_0 = driver.find_element_by_id('introAgreeButton')
agree_btn_0.click()
Reddit source
You can also use below as an alternative to the above two solutions:
import time
time.sleep(30)
this worked for me (the try/finally didn't, kept hitting the finally/browser.close())
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('mywebsite.com')
username = None
while(username == None):
username = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "username"))
)
username.send_keys('myusername#email.com')
It seems that your browser did not read proper HTML texts/tags, use a delay function that'll help the page to load first and then get all tags from the page.
driver = webdriver.Chrome('./chromedriver.exe')
# load the page
driver.get('https://www.instagram.com/accounts/login/')
# use delay function to get all tags
driver.implicitly_wait(20)
# access tag
driver.find_element_by_name('username').send_keys(self.username)
Also for some, it may be due to opening the new tabs when clicking the button(not in this question particularly). Then you can switch tabs by command.
driver.switch_to.window(driver.window_handles[1]) #for switching to second tab
I had the same problem as you and this solution saved me:
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID, 'someid')))
it just means the function is executing before button can be clicked. Example solution:
from selenium import sleep
# load the page first and then pause
sleep(3)
# pauses executing the next line for 3 seconds