I am trying to scrape a website
def start_requests(self):
yield PuppeteerRequest(
'https://www.website.es/xxx/yyy/121111',
callback=self.parse
)
but I a land in a captcha page, with checkbox I am not a robot.
Is there a way to solve it without calling any third parties ?
Maybe by using this: https://github.com/dessant/buster
Related
I'm working on a web scraping project where I'm scraping a product related to a specific niche and this is the URL where I scrap data from :
https://www.aliexpress.com/af/category/200010058.html?categoryBrowse=y&origin=n&CatId=200010058&spm=a2g0o.home.108.18.650c2145CobYJm&catName=backpacks
I used selenium as a middleware to render javascript and this is the code for that :
def process_request(self, request, spider):
self.driver.get(request.url)
WebDriverWait(self.driver, 10).until(
EC.element_to_be_clickable((By.CLASS_NAME, '_3t7zg'))
)
# self.driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
time.sleep(10)
links = self.driver.find_elements("xpath", '//div[1]/div/div[2]/div[2]/div/div[2]/a[#href]')
The problem is that on the webpage there are 60 products, when I look at the number of products I have scraped I find 11. I don't know what the problem is.
I have this website that I need to scrape.
https://www.dawn.com
My goal is to scrape all news content with the keyword "Pakistan"
So far, I can only scrape the content if I have the URL. For example:
from newspaper import Article
import nltk
nltk.download('punkt')
url = 'https://www.dawn.com/news/1582311/who-chief-lauds-pakistan-for-suppressing-covid-19-while-keeping-economy-afloat'
article = Article(url)
article.download()
article.parse()
article.nlp()
article.summary
From this code, I wrote I would to copy and paste all the URLs and that is too much to do manually. Do you have any idea on how to do this?
better is goto> https://www.dawn.com/pakistan & download (.html) then scrape all the news content, later bifurcate using keywords.
https://www.linkedin.com/learning/topics/business
This is url which I used to scrape the required content like Windows Quick Tips
source=urllib.request.urlopen
("https://www.linkedin.com/learning/topics/business")
source=source.read()
soup = bs.BeautifulSoup(source,'lxml')
f=soup.body
r=soup.find_all('div')
for x in r:
d=x.find('main')
I want to scrape the content like Windows Quick Tips in webpage
I have a spider setup using link extractor rules. The spider crawls and scrapes the items that I expect, although it will only follow the 'Next' pagination button to the 3rd page, where the spider then finishes without any errors, there are a total of 50 pages to crawl via the 'Next' pagination. Here is my code:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
class MySpider(CrawlSpider):
name = 'my_spider'
start_urls = [some_base_url]
rules = (
Rule(LinkExtractor(allow='//div[#data-test="productGridContainer"]//a[contains(#data-test, "product-title")]'), callback='parse_item'),
Rule(LinkExtractor(restrict_xpaths='//div[#data-test="productGridContainer"]//a[contains(#data-test, "next")]'), follow=True)
)
def parse_item(self, response):
# inspect_response(response, self)
...items are being scraped
return scraped_info
It feels like I may be missing a setting or something as the code functions as expected for the first 3 iterations. My settings file does not override the DEPTH_LIMIT default of 0. Any help is greatly appreciated, thank you.
EDIT 1
It appears it may not have anything to do with my code as if I start with a different product page I can get up to 8 pages scraped before the spider exits. Not sure if its the site I am crawling or how to troubleshoot?
EDIT 2
Troubleshooting it some more it appears that my 'Next' link disappears from the web page. When I start on the first page the pagination element is present to go to the next page. When I view the response for the next page there are no products and no next link element so the spider thinks it it done. I have tried enabling cookies to see if the site is requiring a cookie in order to paginate. That doesnt have any affect. Could it be a timing thing?
EDIT 3
I have adjusted the download delay and concurrent request values to see if that makes a difference. I get the same results whether I pull the page of data in 1 second or 30 minutes. I am assuming 30 minutes is slow enough as I can manually do it faster than that.
EDIT 4
Tried to enable cookies along with the cookie middleware in debug mode to see if that would make a difference. The cookies are fetched and sent with the requests but I get the same behavior when trying to navigate pages.
To check if the site denies too many requests in short time you can add the following code to your spider (e.g. before your rulesstatement) and play around with the value. See also the excellent documentation.
custom_settings = {
'DOWNLOAD_DELAY': 0.4
}
I am traversing this URL. In this javascript:ctrl.set_pageReload(1) function making an AJAX call which then loads page data. How can I write my Rule(LinkExtractor(). to make a traverse or is there some other way?
What is AJAX? Its just simply a request to a link with GET or POST method.
You can check it in Inspect-Element view.
Click on the button you are talking about, then see where is the AJAX going to?
Also instead of scraping URLs via Rule(LinkExtractor(), remove start_urls and def parse() method and do this,
def start_requests(self):
yield Request(url= URLE_HERE, callback=self.parse_detail_page)