How Scrapy filters the crawled urls? - scrapy

I want to know how Scrapy filters those crawled urls? Does it store all urls which are crawled in something like crawled_urls_list, and when it get a new url it looks up the list to check if the url exists ?
Where are the codes of this filtering part of CrawlSpider(/path/to/scrapy/contrib/spiders/crawl.py) ?
Thanks a lot!

By default scrapy keep a fingerprint of seen requests. This list is kept in memory in a python set and appended a file call requests.seen in the directory defined by the JOBDIR variable.
If you restart scrapy the file is reloaded into the python set.
The class that control this is in scrapy.dupefilter
You can overload this class if you need a different behaviour.

Related

Scraping Blogs - avoid already scraped items by checking urls from json/csv in advance

I'd like to scrape newspages / blogs (anything, which contains new informations on a daily basis).
My Crawler works fine and does everything, I kindly asked him to do.
But I cannot find a proper solution to the circumstance, that I'd like him to ignore already scraped urls (or items to keep it more general) and just add new urls/items to an already existing json/csv file.
I've seen many solutions here to check, whether an item exists in a csv file.. but none of this "solutions" did really work.
Scrapy DeltaFetch seems to cannot be installed on my system... I've get errors af. and all the hints, like e.g. $ sudo pip install bsddb3, upgrade this and update that.. etc.. does not do the trick. (tried it for 3 hours now and fed up with solutionfinding for a package, which wasn't updated since 2017).
I hope, that you have a handy and practical solution.
Thank you very much in advance!
Best regards!
An option could be a custom downloader middleware with the following:
A process_response that puts the url you crawled in a database
A process_request method that checks if the url is present in the database. If it's in there, you raise an IgnoreRequest so the request is not going through anymore.

Scrapy shell gives an output of empty list even if the xpath is correct in chrome.Why?

Executed on Scrapy shell
url = "https://www.daraz.com.np/smartphones/?spm=a2a0e.11779170.cate_1.1.287d2d2b2cP9ar"
fetch(url)
r = scrapy.Request(url = url)
fetch(r)
response.xpath("//div[#class='ant-col-20 ant-col-push-4 c1z9Ut']/div[#class='c1_t2i']/div[#class='c2prKC']/div/div/div/div[#class='c16H9d']/a/text()").getall()
##NOTE##
There is no tbody tag in xpath
Why it outputs an empty list in scrapy thought it has 40 text in chrome?
It's because the website is heavily javascript orientated. That means content on the website is being loaded dynamically. It's invoking HTTP requests as the page loads and it's not hard coded into the HTML. So when you use scrapy shell it's not loading the HTML.
Couple of suggestions
Try to re-engineer the HTTP Requests. That is javascript envokes HTTP requests and therefore if you can mimic the requests can you get the data you want. YOu will need to use chrome dev tools or similar to see how the requests are made. This is the most clean and concise way to get data. All other options will slow the spider down and are more brittle.
Scrapy-splash - This prerenders the DOM of the page and allows you to access the HTML you desire.
Scrapy-selenium - A downloader middleware that handles requests with selenium. Not got the full function of selenium package but can render the DOM and you could get the data you require.
Embed selenium into the scrapy spider. It's the worst choice and really should be only used as last resort.
Please see the docs on dynamic content for a bit more detail here

Scrapy: How to stop CrawlSpider after 100 requests

I would like to limit the amount of pages CrawlSpider is visiting on website.
How can I stop the Scrapy CrawlSpider after 100 requests?
I believe you can use closespider extension for that with the CLOSESPIDER_PAGECOUNT setting. According to the docs:
... specifies the maximum number of responses to crawl. If the spider
crawls more than that, the spider will be closed with the reason
closespider_pagecount
All you would need to do is set in your settings.py:
CLOSESPIDER_PAGECOUNT = 100
If this doesn't suit your need, another approach could be writing your own extension using Scrapy's stats module to keep track of number of requests.

Follow only child links using Scrapy

I'm new to Scrapy and I'm not sure how to tell it to only follow links that are subpages of the current url. For example, if you are here:
www.test.com/abc/def
then I want scrapy to follow:
www.test.com/abc/def/ghi
www.test.com/abc/def/jkl
www.test.com/abc/def/*
but not:
www.test.com/abc/*
www.test.com/*
or any other domain for that matter.
http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example
Write a spider deriving on the BaseSpider. In basespider parse call back you need to return the requests which you need to follow through. Just make sure the the request you are generating is of the form you like. i.e. the extracted url from the response using is a child of the current url( this will be response url). And make a request object and yield them.

Modify crawled URL before indexing it

I am using nutch 1.4. I want to manipulate the crawled url before indexing it.
For example, if my URL is http://xyz.com/home/xyz.aspx then I want to modify the URL to http://xyz.com/index.aspx?role=xyz and only the latter field should be indexed in SOLR. The reason is I don't want to expose the first URL. The 2nd URL will ultimately redirect it to same page.
Do we have a provision in Nutch to manipulate the crawled URL's before indexing it to SOLR??
There is no out of the box way to modify the value fed to solr unless you write a custom plugin to do so.
However, this can be easily handled at client side before the results are displayed to the User.