BeautifulSoup can't get all page source sometimes - selenium

I'm using Selenium and beautifulSoup4 for scraping. The problem is that my script sometimes 'result'is empty and sometimes no. I don't understand why it's not working sometimes. Is it a security problem in the website or RAM problem ? I have no idea
page_source = BeautifulSoup(driver.page_source, "html.parser")
result= page_source.find_all('div',{'class':'pv-profile-section-pager ember-view'})

your class name can be error somewhere, you can try:
result= page_source.find_all('div',{'class': lambda x: x and 'pv-profile-section-pager' in x})
or iframe html tag can be also a problem here
Select iframe using Python + Selenium

I would suggest to have some delay, cause there's no error as per OP.
put some time.sleep(5)
if you want to do it using Selenium, I would suggest you to have a look on ExplicitWait from Selenium in Python bindings.
Python - Selenium - ExplicitWait
Sample code :
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))
)
finally:
driver.quit()

Related

Selenium (Python)- Webscraping verb-conjugation tables (Accessing web elements underneath '#document')

Section 0: Introduction:
This is my first webscraping project and I am not experienced in using selenium . I am trying to scrape arabic verb-conjugation tables from the website:
Online Sarf Generator
Any help with the following probelem will be great.
Thank you.
Section 1: The Problem:
I am trying to webscrape from the following website:
Online Sarf Generator
For doing this, I am trying to use Selenium.
I basically need to select the three root letters and the family from the four toggle menus as shown in the picture below:
After this, I have to click the 'Generate Sarf Table' button.
Section 2: My Attempt:
Here is my code:
#------------------ Just Setting Up the web_driver:
s = Service('/usr/local/bin/chromedriver')
# Set some selenium chrome options:
chromeOptions = Options()
# chromeOptions.headless = False
driver = webdriver.Chrome(service=s, options=chromeOptions)
driver.get('https://sites.google.com/view/sarfgenerator/home')
# I switch the frame once:
iframe = driver.find_elements(by=By.CSS_SELECTOR, value='iframe')[0]
driver.switch_to.frame(iframe)
# I switch the frame again:
iframe = driver.find_elements(by=By.CSS_SELECTOR, value='iframe')[0]
driver.switch_to.frame(iframe)
This takes me to the frame within which the webelements that I need are located.
Now, I print the html to see where I am at:
print(BeautifulSoup(driver.execute_script("return document.body.innerHTML;"),'html.parser'))
Here is the output that I get:
<iframe frameborder="0" id="userHtmlFrame" scrolling="yes">
</iframe>
<script>function loadGapi(){var loaderScript=document.createElement('script');loaderScript.setAttribute('src','https://apis.google.com/js/api.js?checkCookie=1');loaderScript.onload=function(){this.onload=function(){};loadGapiClient();};loaderScript.onreadystatechange=function(){if(this.readyState==='complete'){this.onload();}};(document.head||document.body||document.documentElement).appendChild(loaderScript);}function updateUserHtmlFrame(userHtml,enableInteraction,forceIosScrolling){var frame=document.getElementById('userHtmlFrame');if(enableInteraction){if(forceIosScrolling){var iframeParent=frame.parentElement;iframeParent.classList.add('forceIosScrolling');}else{frame.style.overflow='auto';}}else{frame.setAttribute('scrolling','no');frame.style.pointerEvents='none';}clearCookies();clearStorage();frame.contentWindow.document.open();frame.contentWindow.document.write('<base target="_blank">'+userHtml);frame.contentWindow.document.close();}function onGapiInitialized(){gapi.rpc.call('..','innerFrameGapiInitialized');gapi.rpc.register('updateUserHtmlFrame',updateUserHtmlFrame);}function loadGapiClient(){gapi.load('gapi.rpc',onGapiInitialized);}if(document.readyState=='complete'){loadGapi();}else{self.addEventListener('load',loadGapi);}function clearCookies(){var cookies=document.cookie.split(";");for(var i=0;i<cookies.length;i++){var cookie=cookies[i];var equalPosition=cookie.indexOf("=");var name=equalPosition>-1?cookie.substr(0,equalPosition):cookie;document.cookie=name+"=;expires=Thu, 01 Jan 1970 00:00:00 GMT";document.cookie=name+"=;expires=Thu, 01 Jan 1970 00:00:01 GMT ;domain=.googleusercontent.com";}}function clearStorage(){try{localStorage.clear();sessionStorage.clear();}catch(e){}}</script>
However, the actual html on the website looks like this:
Section 3: The main problem with my approach:
I am unable to access the anything #document contained within the iframe.
Section 4: Conclusion:
Is there a possible solution that can fix my current approach to the problem?
Is there any other way to solve the problem described in Section 1?
You put a lot of effort into structuring your question, so I couldn't not answer it, even if it meant double negation.
Here is how you can drill down into the iframe with content:
EDIT: here is how you can select some options, click the button and access the results:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver_linux64/chromedriver") ## path to where you saved chromedriver binary
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(driver, 25)
url = 'https://sites.google.com/view/sarfgenerator/home'
driver.get(url)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, '//*[#aria-label="Custom embed"]')))
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, '//*[#id="innerFrame"]')))
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, '//*[#id="userHtmlFrame"]')))
first_select = Select(wait.until(EC.element_to_be_clickable((By.XPATH, '//select[#id="root1"]'))))
second_select = Select(wait.until(EC.element_to_be_clickable((By.XPATH, '//select[#id="root2"]'))))
third_select = Select(wait.until(EC.element_to_be_clickable((By.XPATH, '//select[#id="root3"]'))))
first_select.select_by_visible_text("ج")
second_select.select_by_visible_text("ت")
third_select.select_by_visible_text("ص")
wait.until(EC.element_to_be_clickable((By.XPATH, ('//button[#onclick="sarfGenerator(false)"]')))).click()
print('clicked')
result = wait.until(EC.presence_of_element_located((By.XPATH, '//p[#id="demo"]')))
print(result.text)
Result printed in terminal:
clicked
جَتَّصَ يُجَتِّصُ تَجتِيصًا مُجَتِّصٌ
جُتِّصَ يُجَتَّصُ تَجتِيصًا مُجَتَّصٌ
جَتِّصْ لا تُجَتِّصْ مُجَتَّصٌ Highlight Root Letters
Selenium setup is for Linux, you just have to observe the imports, and the part after defining the driver.
Selenium documentation can be found here.

selenium get abs url from href attribute

when im downloading a page with selenium and process it with java jsoup. I get the hrefs in the source code like this:
Technical Trading
Is there a way to get the absolute url from this or to force selenium to transform it to an absolute url? Updating the links after getting the page doesn't sound like a clean solution.
If you get the href just with selenium, this works as expected:
yourElement.get_attribute('href')
This is a quick sample:
driver = webdriver.Chrome() # note this is my webdriver
driver.implicitly_wait(10)
url = "https://www.duckduckgo.co.uk"
driver.get(url)
aList = driver.find_elements(By.TAG_NAME, 'a')
for a in aList:
print(a.get_attribute('href'))
Output contains:
https://duckduckgo.com/spread
https://duckduckgo.com/spread
https://duckduckgo.com/app
https://duckduckgo.com/app
https://duckduckgo.com/newsletter
https://duckduckgo.com/newsletter
This is how the DOM looks: (it's relative - but gets the full path)

Python move_to_element().click() is not pressing a right element visible on the screen or returns the error. A trial code is included

I try to interact with the elements (button at this scenario) inside Disqus iframe on this webpage:
This is my trial code:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
path_to_chromedriver = r"c:\users\tv21\source\repos\chromedriver.exe"
driver = webdriver.Chrome(executable_path=path_to_chromedriver)
driver.maximize_window()
url = "https://www.postoj.sk/91472/po-navsteve-kina-si-precitajte-aj-kniznu-predlohu"
driver.get(url)
time.sleep(5)
button_to_close = driver.execute_script("return document.querySelector('body').querySelector('div.grv-dialog-host').shadowRoot.querySelector('div').querySelector('div.buttons-wrapper').querySelector('button.sub-dialog-btn.block_btn')")
ac = ActionChains(driver)
ac.move_to_element(button_to_close).click().perform()
open_discussion = driver.find_element_by_class_name('article-disqus-wrapper')
driver.execute_script("arguments[0].setAttribute('style','display: block;')", open_discussion)
disqus_thread = driver.find_element_by_id("disqus_thread")
iframe_element = disqus_thread.find_element_by_tag_name("iframe")
driver.switch_to.frame(iframe_element)
time.sleep(1)
button_to_load_more = driver.find_element_by_partial_link_text("Nahraj viac komentárov")
ac = ActionChains(driver)
ac.move_to_element(button_to_load_more).click().perform()
The issue is the last command:
ac.move_to_element(button_to_load_more).click().perform()
which shows an error: "move target out of bounds"
I tried instead:
button_to_load_more.click()
and
driver.execute_script("arguments[0].click();", button_to_load_more)
which both work completely fine as the alternatives and I can click the button.
However, I try to understand the reason for being out of bounds when using move_to_element(). I get exactly the same error always when I want to hover over any elements inside Disqus iframe too.
Can anyone help me to fix it or explain to me how to fix it?
First one dint worked because of the known issue in selenium,i guess you are using 3.4 hence facing this.(But it should work after trying newer version of selenium)
Some of the useful links fyr
Selenium MoveTargetOutOfBoundsException even after scrolling to element
https://github.com/SeleniumHQ/selenium/issues/4148

Trying to find the correct xpath

I have made code, see following lines:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.common.by import By
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://www.flashscore.com/match/jBvNMej6/#match-summary")
print(driver.title)
driver.maximize_window() # For maximizing window
driver.implicitly_wait(10) # gives an implicit wait for 20 seconds
driver.find_element_by_id('onetrust-reject-all-handler').click()
time.sleep(2)
driver.find_element(By.CLASS_NAME,'previewShowMore.showMore').click()
main = driver.find_element(By.CLASS_NAME,'previewLine'[b[text()="Hot stat:"]]/text)
print(main.text)
time.sleep(2)
driver.close()
However, I get the following error.
main = driver.find_element(By.CLASS_NAME,'previewLine'[b[text()="Hot stat:"]]/text)
^
SyntaxError: invalid syntax
What can I do to avoid this?
thx! : )
Well, in this line
main = driver.find_element(By.CLASS_NAME,'previewLine'[b[text()="Hot stat:"]]/text)
You have made a great mix :)
Your locator is absolutely invalid.
Also, if you want to print the paragraph text without the "Hot streak" you will need to remove that string from the entire div (paragraph) text.
This should do what you are trying to achieve:
main = driver.find_element(By.XPATH,"//div[#class='previewLine' and ./b[text()='Hot streak']]").text
main = main.replace('Hot streak','')
print(main)
I'm not finding any text 'Hot stat:'. You'll have to attach the html code where you found that.
I assume that you want to retrieve the text of a specific previewLine?
main = driver.find_element(By.XPATH ,'//div[#class="previewLine"]/b[contains(text(),"Hot streak")]/..')
print(main.text)

Selenium - Unable to find element with xpath

I am trying to find an element on this page. Specifically the bid price in the first row: 196.20p.
I am using selenium and this is my code:
from selenium import webdriver
driver = webdriver.PhantomJS()
address = 'https://www.trustnet.com/factsheets/o/g6ia/ishares-global-property-securities-equity-index-uk'
xpath = '//*[#id="factsheet-tabs"]/fund-tabs/div/div/fund-tab[3]/div/unit-details/div/div/unit-information/div/table[2]/tbody/tr[3]/td[2]'
price = driver.find_element_by_xpath(asset['xpath'])
print price.text
driver.close()
When executed I receive the following error
NoSuchElementException: Message: {"errorMessage":"Unable to find element with xpath '//*[#id=\"factsheet-tabs\"]/fund-tabs/div/div/fund-tab[3]/div/unit-details/div/div/unit-information/div/table[2]/tbody/tr[3]/td[2]'","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"214","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:62727","User-Agent":"Python http auth"},"httpVersion":"1.1","method":"POST","post":"{\"using\": \"xpath\", \"sessionId\": \"8faaff70-af12-11e7-a17c-416247c75eb6\", \"value\": \"//*[#id=\\\"factsheet-tabs\\\"]/fund-tabs/div/div/fund-tab[3]/div/unit-details/div/div/unit-information/div/table[2]/tbody/tr[3]/td[2]\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/8faaff70-af12-11e7-a17c-416247c75eb6/element"}}
Screenshot: available via screen
I have used the same approach, but with different xpath, on yahoo finance and it works fine, but unfortunately the price I am looking for is not available there.
If I didn't fail to understand your requirement then this is the price you wanted to scrape. I used css selector here.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.trustnet.com/factsheets/o/g6ia/ishares-global-property-securities-equity-index-uk')
price = driver.find_element_by_css_selector('[ng-if^="$ctrl.priceInformation.Mid"] td:nth-child(2)').text
print(price.split(" ")[0])
driver.quit()
Result:
196.20p/196.60p
If you wanna stick to xpath then try this:
price = driver.find_element_by_xpath('//*[contains(#ng-if,"$ctrl.priceInformation.Mid")]//td[2]').text