How to get around being blocked with "Scrapy" - scrapy

Background:
I am planning on buying a car, and want to monitor the prices.
I'd like to use Scrapy to do this for me. However the site, blocks my code from doing this.
MWE/Code:
#!/usr/bin/python3
# from bs4 import BeautifulSoup
import scrapy # adding scrapy to our file
urls = ['https://www.carsales.com.au/cars/volkswagen/golf/7-series/wagon-bodystyle/diesel-fueltype/']
class HeadphoneSpider(scrapy.Spider): # our class inherits from scrapy.Spider
name = "headphones"
def start_requests(self):
urls = ['https://www.carsales.com.au/cars/volkswagen/golf/7-series/wagon-bodystyle/diesel-fueltype/']# list to enter our urls
# urls = ['https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=headphones&rh=i%3Aaps%2Ck%3Aheadphones&ajr=2']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse) # we will explain the callback soon
def parse(self, response):
img_urls = response.css('img::attr(src)').extract()
with open('urls.txt', 'w') as f:
for u in img_urls:
f.write(u + "\n")
def main():
scraper()
Output:
...some stuff above it
2020-01-10 00:37:59 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.carsales.com.au/cars/volkswagen/golf/7-series/wagon-bodystyle/diesel-fueltype/>: HTTP status code is not handled or not allowed
..some more stuff underneath
Question:
I just don't know how I can circumnavigate this not allowed to parse the prices, Km's etc. It would make my life so much easier. How can I get past this block? FWIW I also tried it with BeautifulSoup which didn't work.

There are multiple ways to avoid being blocked by the sites while scraping that:
Set ROBOTSTXT_OBEY = False
Increase DOWNLOAD_DELAY between your requests like 3 to 4 seconds depending upon the site behavior
Set CONCURRENT_REQUESTS to 1
Use proxies or pool of proxies by customizing proxy_middleware and serve the cause
Carry site cookies in requests so the site does not identify bot behavior
You can try these solutions sequentially

Related

Request callback issue with infinite crawler

I am writing a Scrapy spider whose purpose is to make requests to a remote server in order to hydrate the cache. It's an infinite crawler because I need to make requests after regular intervals. I created initial spider which generates request and hit the server, it worked fine but now when I am making it running infinitely, I am not getting responses. I even tried to debug in the process_response middleware but couldn't get my spider till there. Here is a sketch of code which I am implementing
def generate_requests(self, payloads):
for payload in payloads:
if payload:
print(f'making request with payload {payload}')
yield Request(url=Config.HOTEL_CACHE_AVAILABILITIES_URL, method='POST', headers=Config.HEADERS,
callback=self.parse, body=json.dumps(payload), dont_filter=True, priority=1)
def start_requests(self):
crawler_config = CrawlerConfig()
while True:
if not self.city_scheduler:
for location in crawler_config.locations:
city_name = location.city_name
ttl = crawler_config.get_city_ttl(city_name)
payloads = crawler_config.payloads.generate_payloads(location)
self.city_scheduler[location.city_name] = (datetime.now() + timedelta(minutes=ttl)).strftime("%Y-%m-%dT%H:%M:%S")
yield from self.generate_requests(payloads)
Seems like scrapy has some odd behavior with while loop in start_requests. you can check similar enhancement on scrapy repo here.
Moving while loop logic in your parse method will solve this issue.

Does each scrapy request go through middlewares?

I have something like this in my spider:
def some_parse(self,response):
# ... other code here
for link in extracted_links:
log.info(link)
yield scrapy.Request(link, callback=self.some_parse, method="GET")
Within my custom downloader middleware, I have something like this:
def process_request(self, request, spider):
#do something
log.info(request.url)
request.headers.setdefault('User-Agent', "some randomly selected useragent")
I am getting several thousands logs from some_parse while only a few hundred from process_request. Why is it so? Doesn't each and every page request go through the middleware?
2 questions
Did you let your scraper finish entirely?
There is a change that most of URLs were duplicate and that is why did not go through middleware but show in your spider's some_parse
I am pretty sure your URLs were filtered due to duplicate.
I think I figured out the issue. I had:
def some_parse(self,response): #1
# ... other code here
for link in extracted_links: #2
log.info(link) #3
yield scrapy.Request(link, callback=self.some_parse, method="GET") #4
If depth limit is N and if the response belongs to the URL belonging to Nth depth, then all links it is yielding won't pass through the middleware while it has already got logged. Hence the discrepancy!
The correct way to log is:
def some_parse(self,response):
# ... other code here
log.info(response.url)
for link in extracted_links:
yield scrapy.Request(link, callback=self.some_parse, method="GET")

Scrapy spider_idle signal - need to add requests with parse item callback

In my Scrapy spider I have overridden the start_requests() method, in order to retrieve some additional urls from a database, that represent items potentially missed in the crawl (orphaned items). This should happen at the end of the crawling process. Something like (pseudo code):
def start_requests(self):
for url in self.start_urls:
yield Request(url, dont_filter=True)
# attempt to crawl orphaned items
db = MySQLdb.connect(host=self.settings['AWS_RDS_HOST'],
port=self.settings['AWS_RDS_PORT'],
user=self.settings['AWS_RDS_USER'],
passwd=self.settings['AWS_RDS_PASSWD'],
db=self.settings['AWS_RDS_DB'],
cursorclass=MySQLdb.cursors.DictCursor,
use_unicode=True,
charset="utf8",)
c=db.cursor()
c.execute("""SELECT p.url FROM products p LEFT JOIN product_data pd ON p.id = pd.product_id AND pd.scrape_date = CURDATE() WHERE p.website_id = %s AND pd.id IS NULL""", (self.website_id,))
while True:
url = c.fetchone()
if url is None:
break
# record orphaned product
self.crawler.stats.inc_value('orphaned_count')
yield Request(url['url'], callback=self.parse_item)
db.close()
Unfortunately, it appears as though the crawler queues up these orphaned items during the rest of the crawl - so, in effect, too many are regarded as orphaned (because the crawler has not yet retrieved these items in the normal crawl, when the database query is executed).
I need this orphaned process to happen at the end of the crawl - so I believe I need to use the spider_idle signal.
However, my understanding is I can't just simply yield requests in my spider idle method - instead I can use self.crawler.engine.crawl?
I need requests to be processed by my spider's parse_item() method (and for my configured middleware, extensions and pipelines to be obeyed). How can I achieve this?
the idle method that was connected to the idle signal (let's say the idle method is called idle_method) should receive the spider as argument, so you could do something like:
def idle_method(self, spider):
self.crawler.engine.crawl(Request(url=myurl, callback=spider.parse_item), spider)

How to make Scrapy execute callbacks before the start_requests method finishes?

I have a large file of relative urls that I want to scrape with Scrapy, and I've written some code to read this file line-by-line and build requests for my spider to parse. Below is some sample code.
spider:
def start_requests(self):
with open(self._file) as infile:
for line in infile:
inlist = line.replace("\n","").split(",")
item = MyItem(data = inlist[0])
request = scrapy.Request(
url = "http://foo.org/{0}".format(item["data"]),
callback = self.parse_some_page
)
request.meta["item"]
yield request
def parse_some_page(self,response):
...
request = scrapy.Request(
url = "http://foo.org/bar",
callback = self.parse_some_page2
)
yield request
This works fine, but with a large input file, I'm seeing that parse_some_page2 isn't invoked until start_requests finishes yielding all the initial requests. Is there some way I can make Scrapy start invoking the callbacks earlier? Ultimately, I don't want to wait for a million requests before I start seeing items flow through the pipeline.
I came up with 2 solutions. 1) Run spiders in separate processes if there are too many large sites. 2) Use deferreds and callbacks via Twisted (please don't run away, it won't be too scary). I'll discuss how to use the 2nd method because the first one can simply be googled.
Every function that executes yield request will "block" until a result is available. So your parse_some_page() function yields a scrapy response object and will not go on to the next URL until a response is returned. I did manage to find some sites (mostly foreign government sites) that take a while to fetch and hopefully it simulates a similar situation you're experiencing. Here is a quick and easy example:
# spider/stackoverflow_spider.py
from twisted.internet import defer
import scrapy
class StackOverflow(scrapy.Spider):
name = 'stackoverflow'
def start_requests(self):
urls = [
'http://www.gob.cl/en/',
'http://www.thaigov.go.th/en.html',
'https://www.yahoo.com',
'https://www.stackoverflow.com',
'https://swapi.co/',
]
for index, url in enumerate(urls):
# create callback chain after a response is returned
deferred = defer.Deferred()
deferred.addCallback(self.parse_some_page)
deferred.addCallback(self.write_to_disk, url=url, filenumber=index+1)
# add callbacks and errorbacks as needed
yield scrapy.Request(
url=url,
callback=deferred.callback) # this func will start the callback chain AFTER a response is returned
def parse_some_page(self, response):
print('[1] Parsing %s' % (response.url))
return response.body # this will be passed to the next callback
def write_to_disk(self, content, url, filenumber):
print('[2] Writing %s content to disk' % (url))
filename = '%d.html' % filenumber
with open(filename, 'wb') as f:
f.write(content)
# return what you want to pass to the next callback function
# or raise an error and start Errbacks chain
I've changed things slightly to be a bit easier to read and run. The first thing to take note of in start_requests() is that Deferred objects are created and callback functions are being chained (via addCallback()) within the urls loop. Now take a look at the callback parameter for scrapy.Request:
yield scrapy.Request(
url=url,
callback=deferred.callback)
What this snippet will do is start the callback chain immediately after scrapy.Response becomes available from the request. In Twisted, Deferreds start running callback chains only after Deferred.callback(result) is executed with a value.
After a response is provided, the parse_some_page() function will run with the Response as an argument. What you will do is extract what ever you need here and pass it to the next callback (ie. write_to_disk() my example). You can add more callbacks to the Deferred in the loop if necessary.
So the difference between this answer and what you did originally is that you used yield to wait for all the responses first, then execute callbacks. Where as my method uses Deferred.callback() as the callback for each request such that each response will be processed immediately.
Hopefully this helps (and/or works).
References
Twisted Deferred Reference
Explanation of parse(): Briefly summarizes how yeild/return affects parsing.
Non-Blocking Recipes (Klien): I blog post I wrote a while back on async callbacks in Klien/Twisted. Might be helpful to newbies.
PS
I have no clue if this will actually work for you since I couldn't find a site that is too large to parse. Also, I'm brand-spankin' new at Scrapy :D but I have years of Twisted under my belt.

Django-Rest-Framework: How to Document GET-less Endpoint?

My co-worker implemented an API that only allows GET requests with an ID parameter (so I can GET /foo/5 but can't GET /foo/). If I try to access the API's endpoint without providing an ID parameter, it (correctly) throws an unimplemented exception.
I want to fix this endpoint to show its documentation when viewed, without an ID, over the web. However, I still want it to throw an exception when that endpoint is accessed programatically.
As I remember it, django-rest-framework is capable of distinguishing those two cases (via request headers), but I'm not sure how to define the endpoint such that it returns either documentation HTML or an exception as appropriate.
Can anyone help provide the pattern for this?
Based on the description, I would guess that the endpoint is a function based view, which is registered on a route where it listens for get requests WITH parameters. I would suggest to register another route where you will listen for get requests without parameters...
from rest_framework.decorators import api_view
from rest_framework import status
#api_view(['GET'])
def existing_get_item_api(request, item_id, *args, **kwargs):
# query and return the item here ...
pass
#api_view(['GET'])
def get_help(request, *args, **kwargs):
# compose the help
return Response(data=help, status = status.HTTP_200_OK)
# somewhere in urls.py
urlpatterns = [
url(r'api/items/(?P<item_id>[0-9]+)/', existing_get_item_api),
url(r'api/items/', get_help),
]
Let me know how is this working out for you.
We can user modelviewsets and routers for this implementation
viewsets.py
class AccountViewSet(viewsets.ModelViewSet):
"""
A simple ViewSet for viewing and editing accounts.
"""
http_method_names = ['GET']
queryset = Account.objects.all()
serializer_class = AccountSerializer
routers.py
from rest_framework import routers
router = routers.SimpleRouter()
router.register(r'accounts', AccountViewSet)