First time using selenium for web scraping a website, and I'm fairly new to python. I have tried to scrape a Swedish housing site to extract price, address, area, size, etc., for every listing for a specific URL that shows all houses for sale in a specific area called "Lidingö".
I managed to bypass the pop-up window for accepting cookies.
However, the output I get from the terminal is blank when the script runs. I get nothing, not an error, not any output.
What could possibly be wrong?
The code is:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
s = Service("/Users/brustabl1/hemnet/chromedriver")
url = "https://www.hemnet.se/bostader?location_ids%5B%5D=17846&item_types%5B%5D=villa"
driver = webdriver.Chrome(service=s)
driver.maximize_window()
driver.implicitly_wait(10)
driver.get(url)
# The cookie button clicker
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[62]/div/div/div/div/div/div[2]/div[2]/div[2]/button"))).click()
lists = driver.find_elements(By.XPATH, '//*[#id="result"]/ul[1]/li[1]/a/div[2]')
for list in lists:
adress = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[2]/a/div[2]/div/div[1]/div[1]/h2')
area = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[1]/div[1]/div/span[2]')
price = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[1]')
rooms = list.find_element(By.XPATH,'//*
[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[3]')
size = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[2]')
print(adress.text)
There are a lot of flaws in your code...
lists = driver.find_elements(By.XPATH, '//*[#id="result"]/ul[1]/li[1]/a/div[2]')
in your code this returns a list of elements in the variable lists
for list in lists:
adress = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[2]/a/div[2]/div/div[1]/div[1]/h2')
area = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[1]/div[1]/div/span[2]')
price = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[1]')
rooms = list.find_element(By.XPATH,'//*
[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[3]')
size = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[2]')
print(adress.text)
you are not storing the value of each address in a list, instead, you are updating its value through each iteration.And xpath refers to the exact element, your loop is selecting the same element over and over again!
And scraping text through selenium is a bad practice, use BeautifulSoup instead.
Related
I am learning web scraping with Selenium for Finance team project. The idea is:
Login to HR system
Search for Purchase Order Number
System display list of attachments
Download the attachments
Below are my codes:
# interaction with Google Chrome
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# specify chromedriver location
PATH = './chromedriver_win32/chromedriver.exe'
# open Google Chrome browser & visit Purchase Order page within HRIS
browser = webdriver.Chrome(PATH)
browser.get('https://purchase.sea.com/#/list')
< user input ID & password >
# user interface shows "My Purhcase Request" & "My Purchase Order" tabs
# click on the Purchase Order tab
try:
po_tab = WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.LINK_TEXT, "My Purchase Orders"))
)
po_tab.click()
except:
print('HTML element not found!!')
# locate PO Number field and key in PO number
try:
po_input_field = WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "input-field"))
)
po_input_field.send_keys(<dummy PO num#>) ## any PO number
except:
print("PO field not found!!")
# locate Search button and click search
try:
search_button = WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.LINK_TEXT, "Search"))
)
search_button.click()
except:
print("Search button not found!!")
I stuck at the step # click on the Purchase Order tab and following steps.
I can find the elements but I can see error after executing the py script. The most interesting part is....I can do it perfectly in Jupyter Notebook.
Python Script Execute Error
Here are the elements after inspection screens:
Purchase Orders tab
PO Number input field
Search button
See you are using presence_of_element_located which is basically
""" An expectation for checking that an element is present on the DOM
of a page. This does not necessarily mean that the element is visible.
locator - used to find the element
returns the WebElement once it is located
"""
What I would suggest to you is to use element_to_be_clickable
""" An expectation for checking that an element is present on the DOM of a
page and visible. Visibility means that the element is not only displayed
but also has a height and width that is greater than 0.
locator - used to find the element
returns the WebElement once it is located and visible
so, in your code it'd be something like this :
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.LINK_TEXT, "My Purchase Orders"))).click()
Also, we could try with different locators if this does not work, locators like CSS, XPATH. But I'll let you comment first if this works or not.
The first error in your log is HTML element not found! You are performing a click before element is visible on DOM. Please try below possible solutions,
Wait till element is visible then perform click operation.
EC.visibility_of_element_located((By.LINK_TEXT, "My Purchase Orders"));
If still you are not able click with above code then wait for element to be clickable.
EC.element_to_be_clickable((By.LINK_TEXT, "My Purchase Orders"));
I would suggest to create reusable methods for actions like click() getText() etc.
I'm trying to scrape Google results using selenium chromedriver. Before, I used requests + Beautifulsoup to scrape google Results, and this worked, however I got blocked from Google after around 300 results. I've been reading into this topic and it seems to me that using selenium + webdriver is less easily blocked by Google.
Now, I'm trying to scrape Google results using selenium. I would like to scrape the title, link and description of all items. Essentially, I want to do this: How to scrape all results from Google search results pages (Python/Selenium ChromeDriver)
NoSuchElementException: no such element: Unable to locate element:
{"method":"css selector","selector":"h3"} (Session info:
chrome=90.0.4430.212)
Therefore, I'm trying another code. This code is able to scrape some, but not ALL the titles + descriptions. See picture below. I cannot scrape the last 4 titles, and the last 5 descriptions are also empty. Any clues on this? Much appreciated.
import urllib
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
root = "https://www.google.com/"
url = "https://google.com/search?q="
query = 'Why do I only see the first 4 results?' # Fill in google query
query = urllib.parse.quote_plus(query)
link = url + query
print(f'Main link to search for: {link}')
options = Options()
# options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options)
driver.get(link)
wait = WebDriverWait(driver, 30)
wait.until(EC.presence_of_all_elements_located((By.XPATH, './/h3')))
link_tag = './/div[#class= "yuRUbf"]/a'
title_tag = './/h3'
description_tag = './/span[#class= "aCOpRe"]'
titles = driver.find_elements_by_xpath(title_tag)
links = driver.find_elements_by_xpath(link_tag)
descriptions = driver.find_elements_by_xpath(description_tag)
for t in titles:
print('title:', t.text)
for l in links:
print('links:', l.get_attribute("href"))
for d in descriptions:
print('descriptions:', d.text)
# Why are the last 4 titles and the last 5 descriptions empty??
Image of the results:
Cause those 4 are not the actual links, Google always show "People also ask". If you see their DOM structure
<div style="padding-right:24px" jsname="xXq91c" class="cbphWd" data-
kt="KjCl66uM1I_i7PsBqYb-irfI74DmAeDWm-uv7IveYLKIxo-bn9L1H56X2ZSUy9L-6wE"
data-hveid="CAgQAw" data-ved="2ahUKEwjAoJ2ivd3wAhXU-nMBHWj1D8EQuk4oAHoECAgQAw">
How do I get Google to show all results?
</div>
it is not an anchor tag so you won't see href tag so your links list will have 4 empty value cause there are 4 divs like that.
to grab those 4 you need to use different locator :
XPATH : //*[local-name()='svg']/../following-sibling::div[#style]
title_tags = driver.find_elements(By.XPATH, "//*[local-name()='svg']/../following-sibling::div[#style]")
for title in title_tags:
print(title.text)
How to I effectively use elements retrieved from Selenium that are stored in variables? I am using python. In the program below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
driver = webdriver.Firefox()
driver.get("https://boards.4chan.org/wg/archive")
matching_threads = []
key = "Pixel"
for i in driver.find_elements_by_class_name("teaser-col"):
if key in i.text:
matching_threads.append(i)
matched_thread = i
print(matching_threads)
driver.quit()
I get the following from the printout of matching_threads:
[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="aa74a4a6-5bb2-4b48-92b6-50f5d51a9e5c", element="59b6076f-a5a2-4862-9c1f-028025e4b567")>]
How can I use that output to select said element in selenium and interact with it? What I am trying to do is goto that element and then click on the element to the right of it. What I am failing to understand is how to retrieve the element in selenium using the stored information in matching_threads.
If anyone can help me, I would very much appreciate it.
To click on the next opposing td with an a tag with class quotelink.
i.find_element_by_xpath(".//following::td[1]/a[#class='quotelink']").click()
Now if the page moves to another you could just grab the href value, insert them into an array and than loop through them with and use driver.get(). If it opens a new tab you should be fine.
.get_attribute('href')
Greeting all
i am trying to extract tables from this site https://theunderminejournal.com/#eu/silvermoon/category/battlepets but i am having some difficulties with that. my code and whatever i used failed to bring up any result:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
def getbrowser():
options = Options()
options.add_argument("--disable-extensions")
#options.add_argument('--headless')
driver = webdriver.Chrome(options=options)
return driver
def scrape(): # create scrape engine from scratch
driver = getbrowser()
start = time.time()
site1="https://theunderminejournal.com/#eu/silvermoon/category/battlepets"
driver.get(site1)
time.sleep(10)
tbody = driver.find_element_by_tag_name("table")
#cell = tbody.find_elements_by_tag_name("tr").text
for tr in tbody:
td = tbody.find_elements_by_tag_name("tr")
print (td)
driver.close()
scrape()
my goal is to extract the name and the first price from each pet (from all the tables) and create a table with these two values.
generally i am building a scrape bot that will compare the prices from two servers....
i know that my scraping skills are too low , can you please point me where i could find something to read or watch to improve myself.
thanks again for your time
Get all the names and prices in 2 lists, and use their value in order, just replace the print command with whatever you want
names = driver.find_elements_by_css_selector("[class='name'] a")
prices = driver.find_elements_by_css_selector(":nth-child(4)[class='price'] span")
i = 0
for x in names
print (x.text)
print (prices[i].text)
i+=1
hope it helps.
I have written a python script that aims to take data off a website but I am unable to navigate and loop through pages to collect the links. The website is https://www.shearman.com/people? The Xpath on the site looks like this below;
ul class="results-pagination"
li class/a href onclick="PageRequest('2', event)"
When I run the query below is says that the element is not attached to the page;
try:
# this is navigate to next page
driver.find_element_by_xpath('//ul[#class="results-pagination"]/li/[#onclick=">"]').click()
time.sleep(5)
except NoSuchElementException:
break
Any ideas what I'm doing wrong on this?
Many thanks in advance.
Chris
You can try this code :
browser.get("https://www.shearman.com/people")
wait = WebDriverWait(browser, 30)
main_tab = browser.current_window_handle
navigation_buttons = browser.find_elements_by_xpath('//ul[#class="results-pagination"]//descendant::a')
size = len(navigation_buttons )
print ('this the length of list:',navigation_buttons )
i = 0
while i<size:
ActionChains(browser).key_down(Keys.CONTROL).click(navigation_buttons [i]).key_up(Keys.CONTROL).perform()
browser.switch_to_window(main_tab)
i=i+1;
if i >= size:
break
Make sure to import these :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
Note this will open each link in new tab. As per your requirement you can click on next button using this xpath : //ul[#class="results-pagination"]//descendant::a
If you want to open links one by one in same tab , then you will have to handle stale element reference as once you will be moved out from main page , all element will become stale.