Background:
I am trying to scrape information from a link but I cannot seem to get the HTML source code to further parse it.
Link:
https://www.realestate.com.au/buy/property-house-in-vaucluse,+nsw+2030/list-1?source=refinement
Code:
chrome_options = webdriver.ChromeOptions()
preferences = {"safebrowsing.enabled": "false"}
chrome_options.add_experimental_option("prefs", preferences)
chrome_options.add_argument('--disable-gpu')
browser = webdriver.Chrome('link_to_chrome_driver.exee', chrome_options=chrome_options)
url = property_link
print(url)
browser.get(url)
delay = 20 # seconds
try:
WebDriverWait(browser, delay).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'rui-button-brand pagination__link-next')))
time.sleep(10)
except:
pass
html = browser.page_source
soup = BeautifulSoup(html)
print(soup)
Output:
<html lang="en"><head>
<meta charset="utf-8"/>
<link href="about:blank" rel="shortcut icon"/>
</head>
<body>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/j.js"></script>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/f.js"></script>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint/script/kpf.js?url=/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint&token=d33b4707-4c3a-5fbb-8de6-b6889ed26c7d"></script><div></div>
</body></html>
Question:
I don't understand what is going wrong - but when I manually load the site from the any browser - the html script is significantly different. However parsing the site with selenium/bs is far too problematic - What am I doing wrong?
Your CSS selector is incorrect.
Try to edit the css selector as below:
.rui-button-brand.pagination__link-next
Refer to: https://www.w3schools.com/cssref/css_selectors.asp
Related
This question already has answers here:
Webpage Is Detecting Selenium Webdriver with Chromedriver as a bot
(3 answers)
Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
(15 answers)
Closed 2 years ago.
This website is to able tell the difference between a real chrome browser and chromedriver. Does anybody know what is the difference between a real chrome browser and chromedriver? Thanks.
https://www.impactaging.com/full/11/908
$ cat chrdvrget.py
#!/usr/bin/env python3
import sys
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
browser = webdriver.Chrome('chromedriver', options=options)
browser.get(sys.argv[1])
sys.stdout.write(browser.page_source)
browser.close()
$ ./chrdvrget.py https://www.impactaging.com/full/11/908
<html><head>
<script src="https://ajax.googleapis.com/ajax/libs/webfont/1/webfont.js" type="text/javascript" async=""></script><script id="meteor-headers" type="application/ejson">{"token":1590854299420.4485,"headers":{"x-forwarded-for":"128.194.2.41","x-forwarded-proto":"https","x-forwarded-port":"443","host":"www.aging-us.com","x-amzn-trace-id":"Root=1-5ed2829b-12b85ab4e6b408f839aca21c","upgrade-insecure-requests":"1","accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9","sec-fetch-site":"none","sec-fetch-mode":"navigate","sec-fetch-user":"?1","sec-fetch-dest":"document","accept-encoding":"gzip, deflate, br","accept-language":"en-US","x-ip-chain":"128.194.2.41,172.16.3.155"}}</script>
<link rel="stylesheet" type="text/css" class="__meteor-css__" href="/23e8c653e8c598c40de2bfed84e64681cf9fe6b7.css?meteor_css_resource=true">
<script id="irga-analytics" async="" src="//www.google-analytics.com/analytics.js"></script>
<meta name="fragment" content="!">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1.0">
<meta name="google-site-verification" content="d2_-UPxLNh2h2_LXNOVTluwzz1X0G8w1o7NcXwNDWjY">
<meta name="p:domain_verify" content="e99d9967df904cd1dd4e4063bf796a0a">
<meta name="p:domain_verify" content="6022de5b1b2e4515847cbcbc8f4fc3ad">
<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.0.6/css/all.css">
<meta name="fragment" content="!">
<title>Aging</title>
<script>
window.prerenderReady = false;
</script>
<script id="altmetric-embed-js" src="https://d1bxh8uas1mnw7.cloudfront.net/assets/altmetric_badges-75bc9437b4bcd96622a3f013e4e9519d1b65ea847ab601ad6158cf84b9291df9.js"></script></head>
<body>
<script type="text/javascript">__meteor_runtime_config__ = JSON.parse(decodeURIComponent("%7B%22meteorRelease%22%3A%22METEOR%401.8.1%22%2C%22meteorEnv%22%3A%7B%22NODE_ENV%22%3A%22production%22%2C%22TEST_METADATA%22%3A%22%7B%7D%22%7D%2C%22PUBLIC_SETTINGS%22%3A%7B%22journal%22%3A%7B%22issn%22%3A%221945-4589%22%2C%22archive_description%22%3A%22Aging%20US%20has%20been%20publishing%20since%202009%20and%20has%20amassed%20vol_count%20volumes%20as%20of%20current_yr.%22%2C%22home_title%22%3A%22Revolutionizing%20gerontology%20by%20abolishing%20dogmas%22%2C%22logo%22%3A%7B%22banner%22%3A%22%2Fimages%2Faging_logo.png%22%2C%22sharing%22%3A%22%2Fimages%2Faging-logo-blue.png%22%2C%22shareMeta%22%3A%22%2Fimages%2Faging-meta-logo.png%22%7D%2C%22name%22%3A%22Aging%22%2C%22nameExtra%22%3Anull%2C%22site%22%3A%7B%22spec%22%3A%7B%22color%22%3A%7B%22main_hex%22%3A%220a588f%22%2C%22main_rgb%22%3A%2255%2C%2071%2C%2079%22%7D%7D%7D%2C%22submissionsLink%22%3A%22http%3A%2F%2Faging.msubmit.net%2F%22%2C%22altmetric%22%3A%7B%22reportLink%22%3A%22https%3A%2F%2Faging.altmetric.com%2Fdetails%2F%22%2C%22threshold%22%3A10%2C%22template%22%3A%22aging%22%7D%2C%22reprintEmail%22%3A%22printing%40oncotarget.com%22%2C%22siteUrl%22%3A%22https%3A%2F%2Fwww.aging-us.com%22%2C%22sitemapHostUrl%22%3A%22http%3A%2F%2Flocalhost%3A3031%22%7D%2C%22ga%22%3A%7B%22id%22%3A%22UA-74807910-2%22%2C%22trackUserId%22%3Atrue%7D%2C%22s3%22%3A%7B%22bucket%22%3A%22paperchase-aging%22%7D%7D%2C%22ROOT_URL%22%3A%22http%3A%2F%2Faging-cyan.papercha.se%22%2C%22ROOT_URL_PATH_PREFIX%22%3A%22%22%2C%22autoupdate%22%3A%7B%22versions%22%3A%7B%22web.browser%22%3A%7B%22version%22%3A%22b6de6109e579c8788504642644e2aaa8e4fbe19e%22%2C%22versionRefreshable%22%3A%2254bc5b3a9be8ab8b81d73cf07c7a383577471433%22%2C%22versionNonRefreshable%22%3A%22a297acb1cb2103faf27cadad76f96cec9e061066%22%7D%2C%22web.browser.legacy%22%3A%7B%22version%22%3A%22c6651435e54c4057af380b2d941b454c975bd199%22%2C%22versionRefreshable%22%3A%2254bc5b3a9be8ab8b81d73cf07c7a383577471433%22%2C%22versionNonRefreshable%22%3A%22a4fb4e266d67d4f9368deaa186db7760aff155bb%22%7D%7D%2C%22autoupdateVersion%22%3Anull%2C%22autoupdateVersionRefreshable%22%3Anull%2C%22autoupdateVersionCordova%22%3Anull%2C%22appId%22%3A%221w0aki1inxkymkvvdn6%22%7D%2C%22appId%22%3A%221w0aki1inxkymkvvdn6%22%2C%22isModern%22%3Atrue%7D"))</script>
<script type="text/javascript" src="/4926fb393a332fa3481bd3a225f0ee7d42684908.js?meteor_js_resource=true"></script>
<div class="hiddendiv common"></div></body></html>
P.S. These links don't provide an answer to distinguish the difference. And the answers there are also contradictory among them. Therefore they should not be considered as answering my question.
Please provide a working example using the URL at the beginning so that the python code can download the content of the webpage same as a real browser.
Webpage Is Detecting Selenium Webdriver with Chromedriver as a bot
Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
Please help me with the use of BeautifulSoup to web scraping finaces values from investing.com using Python 3.
Whatever I do never get any value, and the filting class is changing permanently from the web page at it is a live value.
import requests
from bs4 import BeautifulSoup
url = "https://es.investing.com/indices/spain-35-futures"
precio_objetivo = input("Introduce el PRECIO del disparador:")
precio_objetivo = float(precio_objetivo)
print (precio_objetivo)
while True:
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
precio_actual = soup.find('span', attrs={'class': 'arial_26 inlineblock pid-8828-last','id':'last_last','dir':'ltr'})
print (precio_actual)
break;
When I don't apply any filter at soup.find (trying at least to get all the web page) I get this result:
<bound method Tag.find_all of
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<title>403 You are banned from this site. Please contact via a different client configuration if you believe that this is a mistake. </title>
</head>
<body>
<h1>Error 403 You are banned from this site. Please contact via a different client configuration if you believe that this is a mistake.</h1>
<p>You are banned from this site. Please contact via a different client configuration if you believe that this is a mistake.</p>
<h3>Guru Meditation:</h3>
<p>XID: 850285196</p>
<hr/>
<p>Varnish cache server</p>
</body>
</html>
It looks like that website detects where the request is coming from, so we need to 'fool' it into thinking we're on a browser.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
r = Request("https://es.investing.com/indices/spain-35-futures", headers={"User-Agent": "Mozilla/5.0"})
c = urlopen(r).read()
soup = BeautifulSoup(c, "html.parser")
print(soup)
The web server detects the python script as a bot and hence blocks it.
By using headers you can prevent it and the following code does it:
import requests
from bs4 import BeautifulSoup
url = "https://es.investing.com/indices/spain-35-futures"
header={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36'}
page=requests.get(url,headers=header)
soup=BeautifulSoup(page.content,'html.parser')
#this soup returns <span class="arial_26 inlineblock pid-8828-last" dir="ltr" id="last_last">9.182,5</span>
result = soup.find('span',attrs={'id':'last_last'}).get_text()
#use the get_text() function to extract the text
print(result)
You can try using selenium web driver. As otherwise you will face this thing more if the number of requests are high. Also sometimes there are problem with sites having JavaScript.
from selenium import webdriver
url = 'https://example.com/'
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(options=options,executable_path='/usr/local/bin/chromedriver')
driver.get(url)
I'm trying to run this simple test on RIDE, but I cannot figure out it's failing without giving me any specific details:
Ride Log
command: pybot.bat --argumentfile c:\users\user\appdata\local\temp\RIDEe2en9t.d\argfile.txt --listener C:\Python27\lib\site-packages\robotide\contrib\testrunner\TestRunnerAgent.py:49555:False C:\Python27\Scripts\test\Login\login_suite.robot
========================================================================================================================================================================
Login Suite
========================================================================================================================================================================
login_user | FAIL |
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<link rel="stylesheet" type="text/css" href="/assets/displayhelpservlet.css" media="all"/>
<link href="/assets/favicon.ico" rel="icon" type="image/x-icon" />
<script src="/assets/jquery-3.1.1.min.js" type="text/javascript"></script>
<script src="/assets/displayhelpservlet.js" type="text/javascript"></script>
<script type="text/javascript">
var json = Object.freeze('{"consoleLink":"/wd/hub","type":"Standalone","version":"3.11.0","class":"org.openqa.grid.web.servlet.DisplayHelpServlet$DisplayHelpServletConfig"}');
</script>
</head>
<body>
<div id="content">
<div id="help-heading">
<h1><span id="logo"></span></h1>
[ Message content over the limit has been removed. ]
</span>
</p>
<p>
Happy Testing!
</p>
</div>
<div>
<footer id="help-footer">
Selenium is made possible through the efforts of our open source community, contributions from
these people, and our
sponsors.
</footer>
</div>
</div>
</body>
</html>
Selenium server is started (standalone-3.11.0)
Python version 2.7
Environment Path is set Python27/Scripts
Here is the test code:
*** Settings ***
Library SeleniumLibrary
*** Test Cases ***
login_user
SeleniumLibrary.Open Browser Google.com googlechrome
Maximize Browser Window
Title Should Be Google
The webdriver for Chrome is also set in Scripts folder, but I've tried it with Firefox as well and got the same result.
EDIT:
so i have tried with this code
*** Settings ***
Library SeleniumLibrary
*** Test Cases ***
login_user
SeleniumLibrary.Open Browser https://google.com googlechrome
Maximize Browser Window
Title Should Be Google
If you have
Selenium
Robot Framework
RIDE (for running robot files)
Chrome
The only thing you have to do is download chromedriver.
https://chromedriver.storage.googleapis.com/index.html?path=2.38/
After you have download the chromedriver put it to a folder and add it to your path.
The way i have done after unzipping chromedriver
Ubuntu:
sudo mv chromedriver /usr/local/bin/
sudo chown root:root /usr/local/bin/chromedriver
Windows:
put chromedriver.exe into a folder in this example C:\drivers\
Press Windows button on your keyboard and type Edit the system environment variables
Under Advanced tab, Click Environment Variables
List item
Under System Variables find "Path" and click on Path and Edit button
Click new and add the path where you have put the chromedriver
in this example C:\drivers\chromedriver
This is ridiculous. Why does it happen??
HTML source code:
<!DOCTYPE html>
<html>
<head>
<title>WTF</title>
<meta charset="utf-8" />
</head>
<body id="b">
<map name="Map" id="Map">
<area
id="clickhereyoustupidselenium" alt="" title=""
href="javascript:document.getElementById('b').innerHTML = 'adsf'"
shape="poly" coords="51,29,155,25,247,87,156,129,52,132,23,78,84,56,104,35" />
<img usemap="#Map" src="http://placehold.it/350x150" alt="350 x 150 pic">
</map>
</body>
</html>
Selenium test code:
from django.contrib.staticfiles.testing import StaticLiveServerTestCase
from selenium.webdriver.firefox.webdriver import WebDriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import text_to_be_present_in_element
from selenium.webdriver.common.by import By
class SeleniumTest(StaticLiveServerTestCase):
#classmethod
def setUpClass(cls):
super(SeleniumTest, cls).setUpClass()
cls.selenium = WebDriver()
#classmethod
def tearDownClass(cls):
cls.selenium.quit()
super(SeleniumTest, cls).tearDownClass()
def test_wtf(self):
self.selenium.get('%s%s' % (self.live_server_url, '/'))
self.selenium.find_element_by_id('clickhereyoustupidselenium').click()
WebDriverWait(self.selenium, 100).until(text_to_be_present_in_element((By.TAG_NAME, "body"), "adsf"))
self.assertEqual(self.selenium.find_element_by_tag_name('body').text, 'adsf')
The test passes beautifully.
OK, so now let's replace src="http://placehold.it/350x150" with a different image, let's say this one: src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/bf/POL_location_map.svg/500px-POL_location_map.svg.png":
<!DOCTYPE html>
<html>
<head>
<title>WTF</title>
<meta charset="utf-8" />
</head>
<body id="b">
<map name="Map" id="Map">
<area
id="clickhereyoustupidselenium" alt="" title=""
href="javascript:document.getElementById('b').innerHTML = 'adsf'"
shape="poly" coords="51,29,155,25,247,87,156,129,52,132,23,78,84,56,104,35" />
<img usemap="#Map" src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/bf/POL_location_map.svg/500px-POL_location_map.svg.png" alt="350 x 150 pic">
</map>
</body>
</html>
Let's not touch Selenium code not a teeny tiny bit.
Result? Selenium raises: selenium.common.exceptions.TimeoutException
And indeed, the Firefox window that shows up still shows the map of Poland, and not 'adsf'. If I click on this area in the Firefox window that shows up until the timeout of 100 seconds passes then Selenium immediately concludes the test has passed. But it was Selenium that was supposed to click on this element!!
What is happening and how to stop this madness?
Geckodriver 0.18.0. Selenium 3.5.0. Firefox 55.0.2. Python 3.5.2. And, if this matters, the dev server is Django 1.11.4.
The root cause is size of <area> on GeckoDriver is incorrect. Selenium WebDriver tries to click at the middle of the element but size of the area equals to the map. So Selenium clicks on a wrong position.
You can calculate the position and force Selenium to click at the position. See code below.
area = driver.find_element_by_id('clickhereyoustupidselenium')
coords = area.get_attribute("coords").split(',')
coordsNumbers = [ int(p) for p in coords ]
x = filter(lambda p: p % 2 != 0, coordsNumbers)
y = filter(lambda p: p % 2 == 0, coordsNumbers)
middleX = (max(x) - min(x))/2
middley = (max(y) - min(y))/2
action = webdriver.common.action_chains.ActionChains(driver)
action.move_to_element_with_offset(area, middleX, middley)
action.click()
action.perform()
WebDriverWait(driver, 100).until(EC.text_to_be_present_in_element((By.TAG_NAME, "body"), "adsf"))
print("Message found")
driver.quit()
I'm running an HTML test suite as such:
java -jar /var/lib/selenium/selenium-server.jar -browserSessionReuse -htmlSuite *firefox http://$HOST ./test/selenium/html/TestSuite.html ./target/selenium/html/TestSuiteResults.html
Is there a way I can run all test suites in a directory, or create a test suite of test suites?
I'm very new to Selenium and really only Selenium2. I'm using "TestNG" as my test framework, and it does support suites and suites of suites using an xml file that specifies which tests carrying a particular annotation are part of the suite.
If running suites of suites is what you are looking for, and you are using Java exclusively (TestNG does not, as I understand it, support anything other than Java), then you may find what you're looking for.
I created a grails script to create a super-testsuite automatically. Needing to modify test suites is one more step in adding a test, and each level of barrier increases the likelihood of developers refusing to write tests.
import groovy.io.FileType
includeTargets << grailsScript("Init")
target(main: "Auto-generates the TestSuite.html file needed for selenium based on selenium html tests in test/selenium/html/**") {
File testSuiteOutputFile = new File("test/selenium/html/TestSuite.html")
testSuiteOutputFile.delete()
String testRows = buildTestRows()
testSuiteOutputFile <<
"""
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta content="text/html; charset=UTF-8" http-equiv="content-type" />
<title>Test Suite</title>
</head>
<body>
<table id="suiteTable" cellpadding="1" cellspacing="1" border="1" class="selenium"><tbody>
<tr><td><b>Test Suite</b></td></tr>
$testRows
</tbody></table>
</body>
</html>
"""
}
private def buildTestRows() {
String testRows = ""
List<File> testFiles = getAllTestFilesInSeleniumDirectory()
testFiles.each { file ->
def relativePath = buildFilePathRelativeToTestSuite(file)
println "Adding $relativePath to TestSuite"
testRows += "<tr><td><a href='${relativePath}'>${file.name}</a></td></tr>"
testRows += "<tr><td><a href='clearCache.html'>Clear Cache</a></td></tr>"
}
testRows
}
private List<File> getAllTestFilesInSeleniumDirectory() {
File testsDirectory = new File("test/selenium/html")
def files = []
testsDirectory.eachFileRecurse(FileType.FILES) { files << it }
files
}
private String buildFilePathRelativeToTestSuite(File file){
File parentDirectory = new File("test/selenium/html")
String relativePath = file.name
file = file.parentFile
while( file != parentDirectory ){
relativePath = file.name + "/" + relativePath;
file = file.parentFile
}
relativePath
}
setDefaultTarget(main)
Look at Selunit. It provides a Maven plugin to execute Selenese suites in batch and transforms reports to Junit format. The last is very usefull for integration of test execution into a CI server like Jenkins, which generates nice charts and notifies in case of test errors.