I'm trying to print search results of DuckDuckgo using a headless WebDriver and Selenium. However, I cannot locate the DOM elements referring to the search results no matter what ID or class name I search for and no matter how long I wait for it to load.
Here's the code:
opts = Options()
opts.headless = False
browser = Firefox(options=opts)
browser.get('https://duckduckgo.com')
search = browser.find_element_by_id('search_form_input_homepage')
search.send_keys("testing")
search.submit()
# wait for URL to change with 15 seconds timeout
WebDriverWait(browser, 15).until(EC.url_changes(browser.current_url))
print(browser.current_url)
results = WebDriverWait(browser,10)
.until(EC.presence_of_element_located((By.ID,"links")))
time.sleep(10)
results = browser.find_elements_by_class_name('result results_links_deep highlight_d result--url-above-snippet') # I tried many other ID's and class names
print(results) # prints []
I'm starting to suspect there is some trickery to avoid web scraping in DuckDuckGo. Does anyone has a clue?
I've changed to use cssSelector then it works.I use java, not python.
List<WebElement> elements = driver.findElements(
By.cssSelector(".result.results_links_deep.highlight_d.result--url-above-snippet"));
System.out.println(elements.size());
//10
Related
First time using selenium for web scraping a website, and I'm fairly new to python. I have tried to scrape a Swedish housing site to extract price, address, area, size, etc., for every listing for a specific URL that shows all houses for sale in a specific area called "Lidingö".
I managed to bypass the pop-up window for accepting cookies.
However, the output I get from the terminal is blank when the script runs. I get nothing, not an error, not any output.
What could possibly be wrong?
The code is:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
s = Service("/Users/brustabl1/hemnet/chromedriver")
url = "https://www.hemnet.se/bostader?location_ids%5B%5D=17846&item_types%5B%5D=villa"
driver = webdriver.Chrome(service=s)
driver.maximize_window()
driver.implicitly_wait(10)
driver.get(url)
# The cookie button clicker
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[62]/div/div/div/div/div/div[2]/div[2]/div[2]/button"))).click()
lists = driver.find_elements(By.XPATH, '//*[#id="result"]/ul[1]/li[1]/a/div[2]')
for list in lists:
adress = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[2]/a/div[2]/div/div[1]/div[1]/h2')
area = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[1]/div[1]/div/span[2]')
price = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[1]')
rooms = list.find_element(By.XPATH,'//*
[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[3]')
size = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[2]')
print(adress.text)
There are a lot of flaws in your code...
lists = driver.find_elements(By.XPATH, '//*[#id="result"]/ul[1]/li[1]/a/div[2]')
in your code this returns a list of elements in the variable lists
for list in lists:
adress = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[2]/a/div[2]/div/div[1]/div[1]/h2')
area = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[1]/div[1]/div/span[2]')
price = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[1]')
rooms = list.find_element(By.XPATH,'//*
[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[3]')
size = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[2]')
print(adress.text)
you are not storing the value of each address in a list, instead, you are updating its value through each iteration.And xpath refers to the exact element, your loop is selecting the same element over and over again!
And scraping text through selenium is a bad practice, use BeautifulSoup instead.
I'm trying to scrape Google results using selenium chromedriver. Before, I used requests + Beautifulsoup to scrape google Results, and this worked, however I got blocked from Google after around 300 results. I've been reading into this topic and it seems to me that using selenium + webdriver is less easily blocked by Google.
Now, I'm trying to scrape Google results using selenium. I would like to scrape the title, link and description of all items. Essentially, I want to do this: How to scrape all results from Google search results pages (Python/Selenium ChromeDriver)
NoSuchElementException: no such element: Unable to locate element:
{"method":"css selector","selector":"h3"} (Session info:
chrome=90.0.4430.212)
Therefore, I'm trying another code. This code is able to scrape some, but not ALL the titles + descriptions. See picture below. I cannot scrape the last 4 titles, and the last 5 descriptions are also empty. Any clues on this? Much appreciated.
import urllib
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
root = "https://www.google.com/"
url = "https://google.com/search?q="
query = 'Why do I only see the first 4 results?' # Fill in google query
query = urllib.parse.quote_plus(query)
link = url + query
print(f'Main link to search for: {link}')
options = Options()
# options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options)
driver.get(link)
wait = WebDriverWait(driver, 30)
wait.until(EC.presence_of_all_elements_located((By.XPATH, './/h3')))
link_tag = './/div[#class= "yuRUbf"]/a'
title_tag = './/h3'
description_tag = './/span[#class= "aCOpRe"]'
titles = driver.find_elements_by_xpath(title_tag)
links = driver.find_elements_by_xpath(link_tag)
descriptions = driver.find_elements_by_xpath(description_tag)
for t in titles:
print('title:', t.text)
for l in links:
print('links:', l.get_attribute("href"))
for d in descriptions:
print('descriptions:', d.text)
# Why are the last 4 titles and the last 5 descriptions empty??
Image of the results:
Cause those 4 are not the actual links, Google always show "People also ask". If you see their DOM structure
<div style="padding-right:24px" jsname="xXq91c" class="cbphWd" data-
kt="KjCl66uM1I_i7PsBqYb-irfI74DmAeDWm-uv7IveYLKIxo-bn9L1H56X2ZSUy9L-6wE"
data-hveid="CAgQAw" data-ved="2ahUKEwjAoJ2ivd3wAhXU-nMBHWj1D8EQuk4oAHoECAgQAw">
How do I get Google to show all results?
</div>
it is not an anchor tag so you won't see href tag so your links list will have 4 empty value cause there are 4 divs like that.
to grab those 4 you need to use different locator :
XPATH : //*[local-name()='svg']/../following-sibling::div[#style]
title_tags = driver.find_elements(By.XPATH, "//*[local-name()='svg']/../following-sibling::div[#style]")
for title in title_tags:
print(title.text)
I am new to programming and extremely new to web scraping. I need to scrape a table from a web page, where the table is displayed after a video.
As I said on the title, I tried implicit waits like:
driver.implicitly_wait(40)
...
inputElement = driver.find_element_by_class_name("_td")
(I tried the same with the xpath)
and explicit waits:
wait = WebDriverWait(driver, 40)
wait.until(EC.visibility_of_element_located(By.XPATH("path...")
(tried the same with class name)
This is what I get:
NoSuchElementException: Message: no such element: Unable to locate element
I would REALLY appreciate your help!
There is an iframe we need to switch to so do this prior to looking for your elements.
iframe = driver.find_element_by_tag_name('iframe')
driver.switch_to.frame(iframe)
So very new here to Selenium but I'm having trouble selecting the element I want from this website. In this case, I got the x_path using Chrome's 'copy XPath tool.' Basically, I'm looking to extract the CID text (in this case 4004) from the website, but my code seems to be unable to do this. Any help would be appreciated!
I have also tried using the CSS selector method as well but it returns the same error.
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.binary_location = '/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary'
driver= webdriver.Chrome()
chem_name = "D008294"
url = "https://pubchem.ncbi.nlm.nih.gov/#query=" + chem_name
driver.get(url)
elements = driver.find_elements_by_xpath('//*[#id="collection-results-container"]/div/div/div[2]/ul/li/div/div/div/div[2]/div[2]/div[2]/span/a/span/span')
driver.close()
print(elements.text)
As of now, this is the error I receive: 'list' object has no attribute 'text'
Here is the xpath that you can use.
//span[.='Compound CID']//following-sibling::a/descendant::span[2]
Why your script did not worked: I 2 issues in your code.
elements = driver.find_elements_by_xpath('//*[#id="collection-results-container"]/div/div/div[2]/ul/li/div/div/div/div[2]/div[2]/div[2]/span/a/span/span')
driver.close() # <== don't close the browser until you are done with all your steps on the browser or elements
print(elements.text) # <== you can not get text from list (python will through error here
How to fix it:
CID = driver.find_element_by_xpath("//span[.='Compound CID']//following-sibling::a/descendant::span[2]").text # <== returning the text using find_element (not find_elements)
driver.close()
print(CID) # <== now you can print `CID` though browser closed as the value already stored in variable.
Function driver.find_elements_by_xpath return list of Element. You should loop to get text of each element,
Like this:
for ele in print(elements.text):
print(ele.text)
Or if you want to match first Element, use driver.find_element_by_xpath function instead.
Using xpath provided chrome is always does not work as expected. First you have to know how to write xpath and verify it chrome console.
see these links, which helps you to know about xpaths
https://www.guru99.com/xpath-selenium.html
https://www.w3schools.com/xml/xpath_syntax.asp
In this case, first find the span contains text Compound CID and move to parent span the down to child a/span/span. something like //span[contains(text(),'Compound CID']/parent::span/a/span/span.
And also you need to findelement which return single element and get text from it. If you use findelements then it will return list of elements, so you need to loop and get text from those elements.
xpath: //a[contains(#href, 'compound')]/span[#class='breakword']/span
you can use the "href" as your attribute reference since I noticed that it has unique value for each component.
Example:
href="https://pubchem.ncbi.nlm.nih.gov/substance/53790330"
href="https://pubchem.ncbi.nlm.nih.gov/compound/4004"
So i need to scrap a page like this for example and i am using Scrapy + Seleninum to interact with the datepicker calendar but i am running into a ElementNotVisibleException: Message: Element is not currently visible and so may not be interacted with.
So far i have:
def parse(self, response):
self.driver.get("https://www.airbnb.pt/rooms/9315238")
try:
element = WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//input[#name='checkin']"))
)
finally:
x = self.driver.find_element_by_xpath("//input[#name='checkin']").click()
import ipdb;ipdb.set_trace()
self.driver.quit()
I saw some references on how to achieve this https://stackoverflow.com/a/25748322/977622 and https://stackoverflow.com/a/19009256/977622 .
I appreciate if someone could help me out with my issue or even provide a better example on how i can interact the this datepicker calendar.
There are two elements with name="checkin" - the first one that you actually find is invisible. You need to make your locator more specific to match the desired input. I would also use the visibility_of_element_located condition instead:
element = WebDriverWait(self.driver, 10).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, ".book-it-panel input[name=checkin]"))
)