How to extract couple of tables from site using selenium

How to extract couple of tables from site using selenium - selenium

Greeting all
i am trying to extract tables from this site https://theunderminejournal.com/#eu/silvermoon/category/battlepets but i am having some difficulties with that. my code and whatever i used failed to bring up any result:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
def getbrowser():
options = Options()
options.add_argument("--disable-extensions")
#options.add_argument('--headless')
driver = webdriver.Chrome(options=options)
return driver
def scrape(): # create scrape engine from scratch
driver = getbrowser()
start = time.time()
site1="https://theunderminejournal.com/#eu/silvermoon/category/battlepets"
driver.get(site1)
time.sleep(10)
tbody = driver.find_element_by_tag_name("table")
#cell = tbody.find_elements_by_tag_name("tr").text
for tr in tbody:
td = tbody.find_elements_by_tag_name("tr")
print (td)
driver.close()
scrape()
my goal is to extract the name and the first price from each pet (from all the tables) and create a table with these two values.
generally i am building a scrape bot that will compare the prices from two servers....
i know that my scraping skills are too low , can you please point me where i could find something to read or watch to improve myself.
thanks again for your time

Get all the names and prices in 2 lists, and use their value in order, just replace the print command with whatever you want
names = driver.find_elements_by_css_selector("[class='name'] a")
prices = driver.find_elements_by_css_selector(":nth-child(4)[class='price'] span")
i = 0
for x in names
print (x.text)
print (prices[i].text)
i+=1
hope it helps.

Related

The output from my selenium script is blank, how do I fix?

First time using selenium for web scraping a website, and I'm fairly new to python. I have tried to scrape a Swedish housing site to extract price, address, area, size, etc., for every listing for a specific URL that shows all houses for sale in a specific area called "Lidingö".
I managed to bypass the pop-up window for accepting cookies.
However, the output I get from the terminal is blank when the script runs. I get nothing, not an error, not any output.
What could possibly be wrong?
The code is:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
s = Service("/Users/brustabl1/hemnet/chromedriver")
url = "https://www.hemnet.se/bostader?location_ids%5B%5D=17846&item_types%5B%5D=villa"
driver = webdriver.Chrome(service=s)
driver.maximize_window()
driver.implicitly_wait(10)
driver.get(url)
# The cookie button clicker
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[62]/div/div/div/div/div/div[2]/div[2]/div[2]/button"))).click()
lists = driver.find_elements(By.XPATH, '//*[#id="result"]/ul[1]/li[1]/a/div[2]')
for list in lists:
adress = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[2]/a/div[2]/div/div[1]/div[1]/h2')
area = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[1]/div[1]/div/span[2]')
price = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[1]')
rooms = list.find_element(By.XPATH,'//*
[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[3]')
size = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[2]')
print(adress.text)

There are a lot of flaws in your code...
lists = driver.find_elements(By.XPATH, '//*[#id="result"]/ul[1]/li[1]/a/div[2]')
in your code this returns a list of elements in the variable lists
for list in lists:
adress = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[2]/a/div[2]/div/div[1]/div[1]/h2')
area = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[1]/div[1]/div/span[2]')
price = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[1]')
rooms = list.find_element(By.XPATH,'//*
[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[3]')
size = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[2]')
print(adress.text)
you are not storing the value of each address in a list, instead, you are updating its value through each iteration.And xpath refers to the exact element, your loop is selecting the same element over and over again!
And scraping text through selenium is a bad practice, use BeautifulSoup instead.

How to use CSS selector with strings

I've been trying to create a program that finds data on a site using selenium. I've downloaded the webdriver and by module.
driver.get("https://en.wikipedia.org/wiki/Main_Page")
number = driver.find_element(By.CSS_SELECTOR("articlecount a"))
print(number)
When I run the code, it displays the error that the str object is not callable. (I'm assuming the error comes from the article count a part.
I've tried removing the two quotations around "articlecount a" but that just creates more errors.
Does anyone know what I have to do to allow the CSS Selector to extract data?

Instead of this
number = driver.find_element(By.CSS_SELECTOR("articlecount a"))
It should be like
number = driver.find_element(By.CSS_SELECTOR,"articlecount a")
print(number)

There are several problems here:
The command you are trying to use has somewhat different syntax.
Instead of
number = driver.find_element(By.CSS_SELECTOR("articlecount a"))
It should be something like
number = driver.find_element(By.CSS_SELECTOR,"articlecount a")
Your locator is wrong.
Instead of articlecount a it probably should be #articlecount a
You are missing a delay here. The best practice is to use Expected Conditions explicit waits
You need to extract the text value from number web element object.
With all the above your code will be:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 20)
driver.get("https://en.wikipedia.org/wiki/Main_Page")
number = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#articlecount a"))).text
print(number)
The output here will be
6,469,178

It should be
number = driver.find_element(By.CSS_SELECTOR,"#articlecount a")
print(number.text)

Failing to scrape the full page from Google Search results using selenium

I'm trying to scrape Google results using selenium chromedriver. Before, I used requests + Beautifulsoup to scrape google Results, and this worked, however I got blocked from Google after around 300 results. I've been reading into this topic and it seems to me that using selenium + webdriver is less easily blocked by Google.
Now, I'm trying to scrape Google results using selenium. I would like to scrape the title, link and description of all items. Essentially, I want to do this: How to scrape all results from Google search results pages (Python/Selenium ChromeDriver)
NoSuchElementException: no such element: Unable to locate element:
{"method":"css selector","selector":"h3"} (Session info:
chrome=90.0.4430.212)
Therefore, I'm trying another code. This code is able to scrape some, but not ALL the titles + descriptions. See picture below. I cannot scrape the last 4 titles, and the last 5 descriptions are also empty. Any clues on this? Much appreciated.
import urllib
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
root = "https://www.google.com/"
url = "https://google.com/search?q="
query = 'Why do I only see the first 4 results?' # Fill in google query
query = urllib.parse.quote_plus(query)
link = url + query
print(f'Main link to search for: {link}')
options = Options()
# options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options)
driver.get(link)
wait = WebDriverWait(driver, 30)
wait.until(EC.presence_of_all_elements_located((By.XPATH, './/h3')))
link_tag = './/div[#class= "yuRUbf"]/a'
title_tag = './/h3'
description_tag = './/span[#class= "aCOpRe"]'
titles = driver.find_elements_by_xpath(title_tag)
links = driver.find_elements_by_xpath(link_tag)
descriptions = driver.find_elements_by_xpath(description_tag)
for t in titles:
print('title:', t.text)
for l in links:
print('links:', l.get_attribute("href"))
for d in descriptions:
print('descriptions:', d.text)
# Why are the last 4 titles and the last 5 descriptions empty??
Image of the results:

Cause those 4 are not the actual links, Google always show "People also ask". If you see their DOM structure
<div style="padding-right:24px" jsname="xXq91c" class="cbphWd" data-
kt="KjCl66uM1I_i7PsBqYb-irfI74DmAeDWm-uv7IveYLKIxo-bn9L1H56X2ZSUy9L-6wE"
data-hveid="CAgQAw" data-ved="2ahUKEwjAoJ2ivd3wAhXU-nMBHWj1D8EQuk4oAHoECAgQAw">
How do I get Google to show all results?
</div>
it is not an anchor tag so you won't see href tag so your links list will have 4 empty value cause there are 4 divs like that.
to grab those 4 you need to use different locator :
XPATH : //*[local-name()='svg']/../following-sibling::div[#style]
title_tags = driver.find_elements(By.XPATH, "//*[local-name()='svg']/../following-sibling::div[#style]")
for title in title_tags:
print(title.text)

How to _scrape_ all the data from this website link using selenium and save the extracted city, location and contact number as csv dataframe object?

The website url to scrape data
http://jawedhabib.co.in/hairandbeautysalons-sl/
Code:
lst = driver.find_element_by_css_selector(".post-17954.page.type-page.status-publish.hentry").text
for i in lst:
driver.implicitly_wait(2)
city = driver.find_element_by_css_selector("tr").text
salon_address = driver.find_element_by_css_selector("tr").text
Contact_number = driver.find_element_by_css_selector("tr").text
print(lst)

Here's the first part of your problem. Starting from the start you need to wait for all elements to load onto the screen. Grab all tables trs that are beyond the first 2 trs which are reserved for the location. From the tr xpath to their child using ./ and grab the td[1-3] text using the attribute('textContent') respectfully.
wait = WebDriverWait(driver, 60)
driver.get("http://jawedhabib.co.in/hairandbeautysalons-sl/")
#driver.maximize_window()
tableValues=wait.until(EC.presence_of_all_elements_located((By.XPATH,"//tbody//tr[position()>2]")))
city=[]
address=[]
contactno=[]
for tr in tableValues:
#print(tr.get_attribute('textContent'))
city.append(tr.find_element_by_xpath("./td[1]").get_attribute('textContent'))
address.append(tr.find_element_by_xpath("./td[2]").get_attribute('textContent'))
contactno.append(tr.find_element_by_xpath("./td[3]").get_attribute('textContent'))
Import
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

Scrolling through pages with Python Selenium

I have written a python script that aims to take data off a website but I am unable to navigate and loop through pages to collect the links. The website is https://www.shearman.com/people? The Xpath on the site looks like this below;
ul class="results-pagination"
li class/a href onclick="PageRequest('2', event)"
When I run the query below is says that the element is not attached to the page;
try:
# this is navigate to next page
driver.find_element_by_xpath('//ul[#class="results-pagination"]/li/[#onclick=">"]').click()
time.sleep(5)
except NoSuchElementException:
break
Any ideas what I'm doing wrong on this?
Many thanks in advance.
Chris

You can try this code :
browser.get("https://www.shearman.com/people")
wait = WebDriverWait(browser, 30)
main_tab = browser.current_window_handle
navigation_buttons = browser.find_elements_by_xpath('//ul[#class="results-pagination"]//descendant::a')
size = len(navigation_buttons )
print ('this the length of list:',navigation_buttons )
i = 0
while i<size:
ActionChains(browser).key_down(Keys.CONTROL).click(navigation_buttons [i]).key_up(Keys.CONTROL).perform()
browser.switch_to_window(main_tab)
i=i+1;
if i >= size:
break
Make sure to import these :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
Note this will open each link in new tab. As per your requirement you can click on next button using this xpath : //ul[#class="results-pagination"]//descendant::a
If you want to open links one by one in same tab , then you will have to handle stale element reference as once you will be moved out from main page , all element will become stale.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to extract couple of tables from site using selenium - selenium

Related

The output from my selenium script is blank, how do I fix?

How to use CSS selector with strings

Failing to scrape the full page from Google Search results using selenium

How to _scrape_ all the data from this website link using selenium and save the extracted city, location and contact number as csv dataframe object?

Scrolling through pages with Python Selenium

Categories

Resources