for my data project, I am trying to scrape a website with selenium. It loads new articles by incrementing the page number :
https://geschenkly.de/page/1/ and then 2/3/4 and so on.
But beginning on the first site, it displays the site on chrome webdriver,but whenever I am trying to find an element, it either is empty or doens't exist:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import json
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument("--headless")
chrome_options.add_argument(f'user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36')
chrome_options.add_argument("window-size=1920,1080")
s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s, chrome_options=chrome_options)
chrome_options = Options()
#page = 1
driver.get('https://geschenkly.de/page/1/')
wait = wait(driver, 60)
elements = driver.find_elements(By.CLASS_NAME, "woocommerce-LoopProduct-link woocommerce-loop-product__link")
The class name is a link to the subdomains of the articles. I can find them when inspecting the page, but on selenium, elements is an empty array
woocommerce-LoopProduct-link woocommerce-loop-product__link are actually multiple class names. You can not locate such elements with By.CLASS_NAME.
To locate element by multiple class names you should use CSS_SELECTOR or XPATH.
Also you need to USE the expected conditions to wait for the elements, not just define that element without a use.
Also your locator could be improved.
This would work better:
driver.get('https://geschenkly.de/page/1/')
wait = wait(driver, 60)
elements = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".woocommerce-LoopProduct-link.woocommerce-loop-product__link")))
With locator above you will get irrelevant elements.
This will give you half less elements than the previous, it looks more correctly
driver.get('https://geschenkly.de/page/1/')
wait = wait(driver, 60)
elements = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.thumb-wrapper.zoom a.woocommerce-LoopProduct-link.woocommerce-loop-product__link")))
Related
I am trying to scrape all company names from inc5000 site ("https://www.inc.com/inc5000/2021"). The problem is that the company names are displayed using JavaScript. I have tried using selenium and requests_html both to render the site but still when I fetch source code of page i get JavaScript. This is what I tried. I am new to web scraping so it is possible that I am making some foolish mistake. please guide
Here is my code.
...
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
options = Options()
options.headless = True
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.get("https://www.inc.com/inc5000/2021")
data=driver.page_source
print(data)
...
You could give some time to render or use seleniums waits:
...
import time
driver.get('https://www.inc.com/inc5000/2021')
time.sleep(5)
data = driver.page_source
soup = BeautifulSoup(data)
for e in soup.select('.company'):
print(e.text)
...
Why do you need beautiful soup, you just could use selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://www.inc.com/inc5000/2021")
companies = [e.text for e in driver.find_elements(By.CLASS_NAME, "company")]
This will only give you the elements in the viewport. You need to improve on that by scrolling.
I am working on a web scraping project at the moment. The website I am trying to get data from is not the easiest to work with. I am using Selenium, and have worked my way through most of the items I want to select.
The website: http://crashinformationky.org/AdvancedSearch
I can get the website to open, and select different properties. But when I select Data, then try and "click" on Today, to change the date. Nothing works.
I have tried using the xpath, css selectore, link text, partial link text, nothing works.
Here is my most recent attempt.
WebDriverWait(driver, timeout=5).until(driver.find_element(By.CSS_SELECTOR, '#QueryPanel-cond-2 > div:nth-child(4) > a'))
date_one = driver.find_element(By.CSS_SELECTOR, '#QueryPanel-cond-2 > div:nth-child(4) > a')
date_one.click()
date_one_enter = driver.find_element(By.XPATH,'//*[#id="dp1647307835636"]')
date_one_enter.send_keys('01/01/2016')
Welcome Kyle!
As a personal preference I like to use Full XPATH's when selecting items with Selenium. I was able to click on TODAY, clear the field, send a new date and hit ENTER. See below:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
options = Options()
options.add_argument('--no-sandbox')
options.add_argument('--disable-extensions')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--enable-logging')
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)
url = "http://crashinformationky.org/AdvancedSearch"
driver.get(url)
WebDriverWait(driver, 30).until(ec.visibility_of_element_located((By.XPATH, "/html/body/div[1]/div[3]/form/div[2]/div/div[1]/div[2]/div[2]/div/div[3]/a"))).click()
WebDriverWait(driver, 30).until(ec.visibility_of_element_located((By.XPATH, "/html/body/div[10]/div/div[3]"))).click()
WebDriverWait(driver, 30).until(ec.visibility_of_element_located((By.XPATH, "/html/body/div[1]/div[3]/form/div[2]/div/div[1]/div[2]/div[2]/div/div[2]/div[2]/div/div[3]/a"))).click()
field = driver.find_element(By.XPATH, "/html/body/div[1]/div[3]/form/div[2]/div/div[1]/div[2]/div[2]/div/div[2]/div[2]/div/div[3]/input")
field.send_keys(Keys.COMMAND + "a")
field.send_keys(Keys.DELETE)
field.send_keys('01/01/2016')
field.send_keys(Keys.ENTER)
Do note that I am using a MAC, so when I send the Keys.COMMAND that would be Keys.CONTROL for Windows.
I've been trying to access the html code of cookie banners using Selenium. For some websites, I can see the cookie banner html in the Firefox Web-Inspector, however, I cannot access it via Selenium.
For example https://faz.net. Here, driver.page_source does not contain the html code of the cookie banner and I also can't access it's elements via driver.find_elements (e.g. the "ZUSTIMMEN" - button. "zustimmen" means "to accept").
What I've tried so far:
from selenium import webdriver
driver = webdriver.Firefox()
driver.implicitly_wait(20)
driver.get("https://faz.net")
print(driver.page_source) # page source does not contain the button "ZUSTIMMEN"
print(driver.find_elements_by_xpath('//button[text()="ZUSTIMMEN"]'))
driver.execute_script("arguments[0].click();", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//button[text()=ZUSTIMMEN"]'))))
What am I doing wrong?
That button ZUSTIMMEN is in iframe. You need to switch the driver focus to iframe like below :
driver.get("https://faz.net")
wait = WebDriverWait(driver, 10)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "iframe[id^='sp_message_iframe']")))
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[title='ZUSTIMMEN']"))).click()
Imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
once you are done with iframe, you can switch to default content like this :
driver.switch_to.default_content()
That element is inside an iframe.
You have to switch to the iframe in order to access the element.
Like this:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("https://faz.net")
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//iframe[contains(#title,'SP')]")))
Now you can click on the cookie button in order to close it with
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "[title='ZUSTIMMEN']"))).click()
I'm trying to scrape some of the loaded JS data from https://surviv.io/stats/player787, such as the number of total kills. Could someone tell me how I can scrape the js loaded data with selenium. Thanks.
EDIT: Here is some of the code
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://surviv.io/stats/player787')
b = browser.find_element_by_tag_name('tr')
The 'tr' which contains the data that I want is not grabbed by selenium
To get the count of kills.Induce WebDriverWait and visibility_of_all_elements_located
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://surviv.io/stats/player787')
allkills = WebDriverWait(browser,20).until(EC.visibility_of_all_elements_located((By.XPATH,"//div[#class='card-mode-stat-name' and text()='KILLS']/following-sibling::div[1]")))
for item in allkills:
print(item.text)
The reason it's not finding it is because the page isn't fully rendered. You can add a wait with selenium so will not move on until the specified element is rendered first.
Also, if it's in a <table> tag, let pandas do the parsing for you (it uses beautifulsoup under the hood to pull out the <table>, <th>, <tr>, and <td> tags, returns them as a list of dataframes once you get the rendered html source:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import pandas as pd
browser = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
browser.get('https://surviv.io/stats/player787')
delay = 3 # seconds
WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'player-stats-overview')))
df = pd.read_html(browser.page_source)[0]
print (df.loc[0,'Kills'])
browser.close()
Output:
18884
print (df)
Wins Kills Games K/G
0 638 18884 8896 2.1
You could avoid the overhead of a browser and simply mimic the POST request the page makes.
import requests
headers = {'content-type': 'application/json; charset=UTF-8'}
data = {"slug":"player787","interval":"all","mapIdFilter":"-1"}
r = requests.post('https://surviv.io/api/user_stats', headers=headers, json=data)
data = r.json()
desired_stats = ['wins', 'kills', 'games', 'kpg']
for stat in desired_stats:
print(stat, ': ' , data[stat])
For OP:
View of payload in network tab visible when you click on the appropriate xhr indicated by the url in my answer (you need to scroll down to see the payload info)
To scrape the values 652, 19152, 8926, 2.1, etc from JS loaded pages you you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
driver.get('https://surviv.io/stats/player787')
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.player-stats-overview td")))])
Using XPATH:
driver.get('https://surviv.io/stats/player787')
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[#class='player-stats-overview']//td")))])
Console Output:
['652', '19152', '8926', '2.1']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Im trying to use Selenium to load next page with results by clicking Load More button from this site.
However the source code of the html page loaded by selenium does not show(load) actual products which one can see when browsing.
Here is my code:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import os
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
#browser = webdriver.Firefox()#Chrome('./chromedriver.exe')
URL = "https://thekrazycouponlady.com/coupons-for/costco"
PATIENCE_TIME = 60
LOAD_MORE_BUTTON_XPATH = '//button[#class = "kcl-btn ng-scope"]/span'
caps = DesiredCapabilities.PHANTOMJS
# driver = webdriver.Chrome(r'C:\Python3\selenium\webdriver\chromedriver_win32\chromedriver.exe')
caps["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"
driver = webdriver.PhantomJS(r'C:\Python3\selenium\webdriver\phantomjs-2.1.1-windows\bin\phantomjs.exe',service_log_path=os.path.devnull,desired_capabilities=caps)
driver.get(URL)
while True:
try:
time.sleep(20)
html = driver.page_source.encode('utf-8')
print(html)
loadMoreButton = driver.find_element_by_xpath(LOAD_MORE_BUTTON_XPATH)
loadMoreButton.click()
except Exception as e:
print (e)
break
print ("Complete")
driver.quit()
Not sure if I can attach sample html file here for reference.
Anyway, what is the problem and how do I load exactly the same page with selenium as i do via browser?
It might be due to the use of PhantomJS, it isn't maintained any more and deprecated from Selenium 3.8.1. Use Chrome headless instead.
options = Options()
options.headless = True
driver = webdriver.Chrome(CHROMEDRIVER_PATH, chrome_options=options)