I'm working on a web scraping project where I'm scraping a product related to a specific niche and this is the URL where I scrap data from :
https://www.aliexpress.com/af/category/200010058.html?categoryBrowse=y&origin=n&CatId=200010058&spm=a2g0o.home.108.18.650c2145CobYJm&catName=backpacks
I used selenium as a middleware to render javascript and this is the code for that :
def process_request(self, request, spider):
self.driver.get(request.url)
WebDriverWait(self.driver, 10).until(
EC.element_to_be_clickable((By.CLASS_NAME, '_3t7zg'))
)
# self.driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
time.sleep(10)
links = self.driver.find_elements("xpath", '//div[1]/div/div[2]/div[2]/div/div[2]/a[#href]')
The problem is that on the webpage there are 60 products, when I look at the number of products I have scraped I find 11. I don't know what the problem is.
Related
I am trying to scrape a website
def start_requests(self):
yield PuppeteerRequest(
'https://www.website.es/xxx/yyy/121111',
callback=self.parse
)
but I a land in a captcha page, with checkbox I am not a robot.
Is there a way to solve it without calling any third parties ?
Maybe by using this: https://github.com/dessant/buster
How can I get page source of current page?
I make driver.get(link) and I am on main page. Then I use selenium to get other page (by tag and xpath) and when I get good page I'd like to obtain its page source.
I tried driver.page_source() but I obtain page source of main page not this current.
driver = webdriver.Chrome(ccc)
driver.get('https://aaa.com')
check1 = driver.find_element_by_xpath('/html/body/div[1]/div/div[2]/button')
check1.click()
time.sleep(1)
check2=driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[2]/div[1]/div/a')
check2.click()
And after check2.click() I am on page with new link (this link only works by click not directly). How can I get page source for this new link?
I need it to change selenium for Beautiful Soup
I have used Webdriver and displaying sources of page
currently, I am trying to scrape a website which generates captcha as text in the source. I want to automate the process of filling the form on that website using selenium. But every time I scrape the website the captcha for selenium webpage and scraped webpage become different, can someone help we out on this?
website: https://webkiosk.jiit.ac.in
from bs4 import BeautifulSoup
results=requests.get("https://webkiosk.jiit.ac.in/")
soup=BeautifulSoup(src,'html5lib')
links=soup.find_all("td");
#finding the captcha from link text
for i in links:
if(i.text.find("Parents")):
f=1;
if(f and i.text!='*' and i.text!='' and len(i.text)==5):
p=(i.text)
break
#captcha is getting stored in variable p
from selenium import webdriver
driver=webdriver.Firefox()
driver.get("https://webkiosk.jiit.ac.in/")
driver.find_element_by_name("MemberCode").send_keys('*****')
driver.find_element_by_name("DATE1").send_keys('******')
driver.find_element_by_name("Password101117").send_keys('#******')
driver.find_element_by_name("txtcap").send_keys(p)
#driver.find_element_by_name("BTNSubmit").click()
I have a spider setup using link extractor rules. The spider crawls and scrapes the items that I expect, although it will only follow the 'Next' pagination button to the 3rd page, where the spider then finishes without any errors, there are a total of 50 pages to crawl via the 'Next' pagination. Here is my code:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
class MySpider(CrawlSpider):
name = 'my_spider'
start_urls = [some_base_url]
rules = (
Rule(LinkExtractor(allow='//div[#data-test="productGridContainer"]//a[contains(#data-test, "product-title")]'), callback='parse_item'),
Rule(LinkExtractor(restrict_xpaths='//div[#data-test="productGridContainer"]//a[contains(#data-test, "next")]'), follow=True)
)
def parse_item(self, response):
# inspect_response(response, self)
...items are being scraped
return scraped_info
It feels like I may be missing a setting or something as the code functions as expected for the first 3 iterations. My settings file does not override the DEPTH_LIMIT default of 0. Any help is greatly appreciated, thank you.
EDIT 1
It appears it may not have anything to do with my code as if I start with a different product page I can get up to 8 pages scraped before the spider exits. Not sure if its the site I am crawling or how to troubleshoot?
EDIT 2
Troubleshooting it some more it appears that my 'Next' link disappears from the web page. When I start on the first page the pagination element is present to go to the next page. When I view the response for the next page there are no products and no next link element so the spider thinks it it done. I have tried enabling cookies to see if the site is requiring a cookie in order to paginate. That doesnt have any affect. Could it be a timing thing?
EDIT 3
I have adjusted the download delay and concurrent request values to see if that makes a difference. I get the same results whether I pull the page of data in 1 second or 30 minutes. I am assuming 30 minutes is slow enough as I can manually do it faster than that.
EDIT 4
Tried to enable cookies along with the cookie middleware in debug mode to see if that would make a difference. The cookies are fetched and sent with the requests but I get the same behavior when trying to navigate pages.
To check if the site denies too many requests in short time you can add the following code to your spider (e.g. before your rulesstatement) and play around with the value. See also the excellent documentation.
custom_settings = {
'DOWNLOAD_DELAY': 0.4
}
Beautiful Soup is a Python library for pulling data out of HTML and XML files. I will use it to extract webpage data,but i didn't find out any way to click the buttons,anchor label which are used in my case the page navigation. So for this shall I have to use any other or beautiful soup has the capability i didn't aware of.
Please advice me!
To answer your tags/comment, yes, you can use them together (Selenium and BeautifulSoup), and no, you can't directly use BeautifulSoup to execute events (clicking etc.). Although I myself haven't ever used them together in the same situation, a hypothetical situation could involve using Selenium to navigate to a target page via a certain path (i.e. click() these options and then click() the button to the next page), and then using BeautifulSoup to read the driver.page_source (where driver is the Selenium driver you created to 'drive' the browser). Since driver.page_source is the HTML of the page, you can use BeautifulSoup as you are used to, parsing out whatever information you need.
Simple example:
from bs4 import BeautifulSoup
from selenium import webdriver
# Create your driver
driver = webdriver.Firefox()
# Get a page
driver.get('http://news.ycombinator.com')
# Feed the source to BeautifulSoup
soup = BeautifulSoup(driver.page_source)
print soup.title # <title>Hacker News</title>
The main idea is that anytime you need to read the source of a page, you can pass driver.page_source to BeautifulSoup in order to read whatever you want.