Unable to paginate with selenium-scrapy, only extracting data for first page - selenium

I am scraping a website for most recent customer rating, with several pages.
The problem is that I am able to interact with the "sortby" option and select "most recent" using Selenium, and scrape data for first page using Scrapy. However, I am unable to extract the data for other pages, the Selenium Web driver somehow does not render the next page. My intension is to automate data scraping.
I am a newb to web scraping. A snippet of the code is attached here (Some information is removed due to confidentiality)
import scrapy
import selenium.webdriver as webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait,Select
import time
from selenium.webdriver.support import expected_conditions as EC
from scrapy import Selector
from selenium.webdriver.edge.options import Options
from scrapy.utils.project import get_project_settings
class ABC(scrapy.Spider):
#"........."
def start_requests(self):
#" ...... "
yield scrapy.Request(url)
def parse(self, response):
settings =get_project_settings()
driver_path = settings.get('EDGE_DRIVER_PATH')
options = Options()
options.add_argument("headless")
ser=Service(driver_path)
driver = webdriver.Edge(service=ser,options = options)
driver.get(response.url)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID,"sort-order-dropdown")))
element_dropdown=driver.find_element(By.ID,"sort-order-dropdown")
select=Select(element_dropdown)
select.select_by_value("recent")
time.sleep(5)
for review in response.css('[data-hook="review"]':
res={
"rating": review.css('[class="a-icon-alt"]::text').get(),
}
yield res
next_page =response.xpath('//a[text()="Next page"]/#href').get()
if next_page:
yield scrapy.Request(response.urljoin(next_page))
driver.quit()

Looks that you're using Scrapy and Selenium instead of scrapy_selenium (I don't see any SeleniumRequest in your code.
Your current spider works like this:
Get page using Scrapy
Get the same page using Selenium webdriver
Perform some actions using Selenium
Parse Scrapy response (for rating and next_page)
As you see you never use / parse Selenium result.

Related

There is an element, why am I getting an NoSuchElementException?

problem:
Given the following code:
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=004000')
# move to my goal
browser.find_element_by_link_text("재무분석").click()
browser.find_element_by_link_text("재무상태표").click()
# extract the data
elem = browser.find_element_by_xpath("//*[#id='faRVArcVR1a2']/table[2]/tbody/tr[2]/td[6]")
print(elem.text)
I write this code to extract finance data.
At first, I just move to page which have wanting data.
And I copy the XPATH by Chrome Browser function.
But although there is 'text', I get faced NoSuchElementException.
Why this problem happen?
try to fix:
At first, I thought that 'is this happen because of the delay'?
Although there is almost no delay in my computer, I just try to fix it.
I add some import code and change 'elem' part:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Chrome()
browser.get('https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=004000')
# move to my goal
browser.find_element_by_link_text("재무분석").click()
browser.find_element_by_link_text("재무상태표").click()
# extract the data
elem = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="faRVArcVR1a2"]/table[2]/tbody/tr[2]/td[6]')))
print(elem.text)
but as a result, only TimeoutException happens..
I don't know why these problem happens. help pls! thank u..
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
browser = webdriver.Chrome()
browser.get('https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=004000')
# move to my goal
browser.find_element_by_link_text("재무분석").click()
browser.find_element_by_link_text("재무상태표").click()
elementXpath = '//table[#summary="IFRS연결 연간 재무 정보를 제공합니다."][2]/tbody/tr[2]/td[6]'
# extract the data
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, elementXpath)))
# Wait for the table to load
time.sleep(1)
elem = browser.find_element(by=By.XPATH, value=elementXpath)
print(elem.text)
There were several problems:
the ID of the div which wraps the table ("faRVArcVR1a2") changes every time you load the page, that's why this is not a proper way of finding the element. I changed that so that it is found by the summary of the table.
WebDriverWait doesn't return the element, that's why you have to get the element with find_element after you know it is present.
Even after you waited for the table to appear, you have to wait an additional second so that all cells of the table load. Otherwise you would get an empty string.

Selenium fails to load elements, despite EC, waits, and scrolling attempts

With the Selenium (3.141), BeautifulSoup (4.7.9), and Python (3.79), I'm trying to scrape what streaming, rental, and buying options are available for a given movie/show. I've spent hours trying to solve this, so any help would be appreciated. Apologies for the poor formatting, in terms of mixing in comments and prior attempts.
Example Link: https://www.justwatch.com/us/tv-show/24
Desired Outcome is a Beautiful soup element that I can then parse (e.g., which streaming services have it, how many seasons are available, etc.),
which has 3 elements (as of now) - Hulu, IMDB TV, and DirecTV.
I tried numerous variations, but only get one of the 3 streaming services for the example link, and even then it's not a consistent result. Often, I get an empty object.
Some of the things that I've tried included waiting for an expected condition (presence or visibility), explicitly using sleep() from the time library. I'm using a Mac (but running Linux via a USB), so there is no "PAGE DOWN" on the physical keyboard. For the keys module, I've tried control+arrow down, page down, and and space (space bar), but on this particular web page they don't work. However, if I'm browsing it in a normal fashion, control+arrow down and space bar help scrolling the desired section into view. As far as I know, there is no fn + arrow down option that works in Keys, but that's another way that I can move in a normal fashion.
I've run both headless and regular options to try to debug, as well as trying both Firefox and Chrome drivers.
Here's my code:
import time
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
firefox_options = Options()
firefox_options.add_argument('--enable-javascript') # double-checking to make sure that javascript is enabled
firefox_options.add_argument('--headless')
firefox_driver_path = 'geckodriver'
driver = webdriver.Firefox(executable_path=firefox_driver_path, options=firefox_options)
url_link = 'https://www.justwatch.com/us/tv-show/24'
driver.get(url_link) # initial page
cookies = driver.get_cookies()
Examples of things I've tried around this part of the code
various time.sleep(3) and driver.implicitly_wait(3) commands
webdriver.ActionChains(driver).key_down(Keys.CONTROL).key_down(Keys.ARROW_DOWN).perform()
webdriver.ActionChains(driver).key_down(Keys.SPACE).perform()
This code yields a timeout error when used
stream_results = WebDriverWait(driver, 15)
stream_results.until(EC.presence_of_element_located(
(By.CLASS_NAME, "price-comparison__grid__row price-comparison__grid__row--stream")))
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser') # 'lxml' didn't work either
Here's code for getting the html related to the streaming services. I've also tried to grab the html code at various levels, ids, and classes of the tree, but the code just isn't there
stream_row = soup.find('div', attrs={'class':'price-comparison__grid__row price-comparison__grid__row--stream'})
stream_row_holder = soup.find('div', attrs={'class':'price-comparison__grid__row__holder'})
stream_items = stream_row_holder\
.find_all('div', attrs={'class':'price-comparison__grid__row__element__icon'})
driver.quit()
I'm not sure if you are saying your code works in some cases or not at all, but I use chrome and the four find_all() lines at the end all produce results. If this isn't what you mean, let me know. The one thing you may be missing is a time.sleep() that is long enough. That could be the only difference...
Note you need chromedriver to run this code, but perhaps you have chrome and can download chromedriver.exe.
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
url = 'https://www.justwatch.com/us/tv-show/24'
driver = webdriver.Chrome('chromedriver.exe', options=chrome_options)
driver.get(url)
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
soup.find_all(class_="price-comparison__grid__row__price")
soup.find_all(class_="price-comparison__grid__row__holder")
soup.find_all(class_="price-comparison__grid__row__element__icon")
soup.find_all(class_="price-comparison__grid__row--stream")
This is the output from the last line:
[<div class="price-comparison__grid__row price-comparison__grid__row--stream"><div class="price-comparison__grid__row__title price-comparison__promoted__title"> Stream </div><div class="price-comparison__grid__row__holder"><!-- --><div class="price-comparison__grid__row__element"><div class="presentation-type price-comparison__grid__row__element__icon"><img alt="Hulu" class="jw-provider-icon price-comparison__grid__row__icon" src="https://images.justwatch.com/icon/116305230/s100" title="Hulu"/><div class="price-comparison__grid__row__price"> 9 Seasons <span class="price-comparison__badge price-comparison__badge--hd price-comparison__badge--hd"> HD </span></div></div></div><!-- --></div></div>,
<div class="price-comparison__grid__row price-comparison__grid__row--stream"><div class="price-comparison__grid__row__title"> Stream </div><div class="price-comparison__grid__row__holder"><!-- --><div class="price-comparison__grid__row__element"><div class="presentation-type price-comparison__grid__row__element__icon"><img alt="Hulu" class="jw-provider-icon price-comparison__grid__row__icon" src="https://images.justwatch.com/icon/116305230/s100" title="Hulu"/><div class="price-comparison__grid__row__price"> 9 Seasons <span class="price-comparison__badge price-comparison__badge--hd price-comparison__badge--hd"> HD </span></div></div></div><div class="price-comparison__grid__row__element"><div class="presentation-type price-comparison__grid__row__element__icon"><img alt="IMDb TV" class="jw-provider-icon price-comparison__grid__row__icon" src="https://images.justwatch.com/icon/134049674/s100" title="IMDb TV"/><div class="price-comparison__grid__row__price"> 8 Seasons <!-- --></div></div></div><div class="price-comparison__grid__row__element"><div class="presentation-type price-comparison__grid__row__element__icon"><img alt="DIRECTV" class="jw-provider-icon price-comparison__grid__row__icon" src="https://images.justwatch.com/icon/158260222/s100" title="DIRECTV"/><div class="price-comparison__grid__row__price"> 1 Season <span class="price-comparison__badge price-comparison__badge--hd price-comparison__badge--hd"> HD </span></div></div></div><!-- --></div></div>]

Selenium can't find search element inside form

I'm trying to use selenium to perform searches in lexisnexis and I can't get it to find the search box.
I've tried find_element_by using all possible attributes and I only get the "NoSuchElementException: Message: no such element: Unable to locate element: " error every time.
See screenshot of the inspection tab -- the highlighted part is the element I need
My code:
from selenium import webdriver
import numpy as np
import pandas as pd
searchTerms = r'something'
url = r'https://www.lexisnexis.com/uk/legal/news' # this is the page after login - not including the code for login here.
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
browser.get(url)
I tried everything:
browser.find_element_by_id('search-query')
browser.find_element_by_xpath('//*[#id="search-query"]')
browser.find_element_by_xpath('/html/body/div/header/div/form/div[2]/input')
etc..
Nothing works. Any suggestions?
Could be possible your site is taking to long to load , in such cases you can use waits to avoid synchronization issue.
wait = WebDriverWait(driver, 10)
inputBox = wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='search-query']")))
Note : Add below imports to your solution
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

Web Scraping app using Scrapy / Selenium generates Error: "ModuleNotFoundError 'selenium'"

Good morning!
I have recently started learning Python and moved onto applying the little I know to create the monstrosity seen below.
In brief:
I am attempting to scrape SEC Edgar (https://www.sec.gov/edgar/searchedgar/cik.htm) for CIK codes of companies which I want to study in more detail (for now just 1 company to see if it's the right approach).
To scrape the CIK-code, I created a scrapy spider, imported selenium and created 3 functions - 1st to insert "company name" in the input name, 2nd to activate the "submit" button and finally, a function to scrape the CIK code once the submit is activated and return item.
Apart from adding the item to items.py, I haven't changed the middlewares or settings.
For some reason, I am getting ModuleNotFoundError for 'selenium', although I have installed the packages and imported selenium & webdriver along with everything else.
I have tried to mess around with indentation and rephrased the code but achieved no improvement.
import selenium
from selenium import webdriver
import scrapy
from ..items import Sec1Item
from scrapy import Selector
class SecSpSpider(scrapy.Spider):
name = 'SEC_sp'
start_urls =
['http://https://www.sec.gov/edgar/searchedgar/cik.htm/']
def parse(self,response):
company_name = 'INOGEN INC'
return scrapy.FormRequest.from_response(response, formdata ={
'company': company_name
}, callback=self.start_requests())
def start_requests(self):
driver = webdriver.Chrome()
driver.get(self.start_urls)
while True:
next_url = driver.find_element_by_css_selector(
'.search-button'
)
try:
self.parse(driver.page_source)
next_url.click()
except:
break
driver.close()
def parse_page(self, response):
items = Sec1Item()
CIK_code = response.css('a::text').extract()
items["CIK Code: "] = Sec1Item
yield items
I seem not to be able to get over the import selenium error, hence I am not sure about the extent of needed adjustments to the remainder of my spider.
Error message:
"File/Users/user1/PycharmProjects/Scraper/SEC_1/SEC_1/spiders/SEC_sp.py", line 1, in <module>
import selenium
ModuleNotFoundError: No module named 'selenium'
Thank you for any assistance and help!

selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element:

I'm trying to automatically generate lots of users on the webpage kahoot.it using selenium to make them appear in front of the class, however, I get this error message when trying to access the inputSession item (where you write the gameID to enter the game)
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.kahoot.it")
gameID = driver.find_element_by_id("inputSession")
username = driver.find_element_by_id("username")
gameID.send_keys("53384")
This is the error:
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element:
{"method":"id","selector":"inputSession"}
Looks like it takes time to load the webpage, and hence the detection of webelement wasn't happening. You can either use #shri's code above or just add these two statements just below the code driver = webdriver.Firefox():
driver.maximize_window() # For maximizing window
driver.implicitly_wait(20) # gives an implicit wait for 20 seconds
Could be a race condition where the find element is executing before it is present on the page. Take a look at the wait timeout documentation. Here is an example from the docs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))
)
finally:
driver.quit()
In my case, the error was caused by the element I was looking for being inside an iframe. This meant I had to change frame before looking for the element:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.google.co.uk/maps")
frame_0 = driver.find_element_by_class_name('widget-consent-frame')
driver.switch_to.frame(frame_0)
agree_btn_0 = driver.find_element_by_id('introAgreeButton')
agree_btn_0.click()
Reddit source
You can also use below as an alternative to the above two solutions:
import time
time.sleep(30)
this worked for me (the try/finally didn't, kept hitting the finally/browser.close())
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('mywebsite.com')
username = None
while(username == None):
username = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "username"))
)
username.send_keys('myusername#email.com')
It seems that your browser did not read proper HTML texts/tags, use a delay function that'll help the page to load first and then get all tags from the page.
driver = webdriver.Chrome('./chromedriver.exe')
# load the page
driver.get('https://www.instagram.com/accounts/login/')
# use delay function to get all tags
driver.implicitly_wait(20)
# access tag
driver.find_element_by_name('username').send_keys(self.username)
Also for some, it may be due to opening the new tabs when clicking the button(not in this question particularly). Then you can switch tabs by command.
driver.switch_to.window(driver.window_handles[1]) #for switching to second tab
I had the same problem as you and this solution saved me:
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID, 'someid')))
it just means the function is executing before button can be clicked. Example solution:
from selenium import sleep
# load the page first and then pause
sleep(3)
# pauses executing the next line for 3 seconds