Scrapy - target specified URLs only - scrapy

Am using Scrapy to browse and collect data, but am finding that the spider is crawling lots of unwanted pages. What I'd prefer the spider to do is just begin from a set of defined pages and then parse the content on those pages and then finish. I've tried to implement a rule like the below but it's still crawling a whole series of other pages as well. Any suggestions on how to approach this?
rules = (
Rule(SgmlLinkExtractor(), callback='parse_adlinks', follow=False),
)
Thanks!

Your extractor is extracting every link because it doesn't have any rule arguments set.
If you take a look at the official documentation, you'll notice that scrapy LinkExtractors have lots of parameters that you can set to customize what your linkextractors extract.
Example:
rules = (
# only specific domain links
Rule(LxmlLinkExtractor(allow_domains=['scrapy.org', 'blog.scrapy.org']), <..>),
# only links that match specific regex
Rule(LxmlLinkExtractor(allow='.+?/page\d+.html)', <..>),
# don't crawl speicific file extensions
Rule(LxmlLinkExtractor(deny_extensions=['.pdf','.html'], <..>),
)
You can also set allowed domains for your spider if you don't want it to wonder off somewhere:
class MySpider(scrapy.Spider):
allowed_domains = ['scrapy.org']
# will only crawl pages from this domain ^

Related

How to scrape multiple json pages with CrawlSpider?

I have a json file that I want to scrape: https://www.website.com/api/list?limit=50&page=1
I can use 'scrapy.Spider' to crawl all the pages no problem, but if it's possible I prefer to do it with 'CrawlSpider'.
I tried to use:
start_urls=['https://www.website.com']
rules = (
Rule(LinkExtractor(allow=r'/api/list\?.+page=\d+'), callback='parse_page', follow=True),
)
and (just to see if it's even getting the first page):
start_urls=['https://www.website.com']
rules = (
Rule(LinkExtractor(allow=r'/api/list'), callback='parse_page', follow=True),
)
and none of them worked.
Is there a way to do it with 'CrawlSpider'?
It is not possible with CrawlSpider.
LinkExtractor used to process CrawlSpider Rules -> can extract links only from html responses (not json api) from tags a and area

Link extractor is not able to get the paths beyond a certain path

I need a bit help and your guidance on Scrapy.
My Start_Url is :: http://lighting.philips.co.uk/prof/
Have pasted my code below, which is able to get the links / paths till the below url. But not going beyond that. I need to go to each product's page, listed under the path below. In the "productsinfamily" page the specific products are listed (perhaps within a java script). My Crawler is not able to reach those individual product pages.
http://www.lighting.philips.co.uk/prof/led-lamps-and-tubes/led-lamps/corepro-ledbulb/productsinfamily/
Below is the code for the Crawl spider-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ProductSearchSpider(CrawlSpider):
name = "product_search"
allowed_domains = ["lighting.philips.co.uk"]
start_urls = ['http://lighting.philips.co.uk/prof/']
rules = (Rule(LinkExtractor(allow=
(r'^https?://www.lighting.philips.co.uk/prof/led-lamps-and-tubes/.*', ),),
callback='parse_page', follow=True),)
def parse_page(self, response):
yield{'URL' : response.url}
You are right that the links are defined in javascript.
If you take a look at the html source, on line 3790 you can see a variable named d75products created. This is later used to populate a template and display the products.
The way I'd approach this would be to extract this data from the source and use the json module to load it. Once you have the data, you can do with it whatever you want.
Another way would be to use something (e.g. a browser) to execute the javascript, and then parse the resulting html. I do think that's unnecessary and overcomplicated though.

How do scrapy crawlspider rules work?

I have an issue with crawlspider rules.
here my rule definition;
rules = (
Rule(LinkExtractor(allow=("/liste/.*/department\.aspx\?categoryId=\d+", ))),
Rule(LinkExtractor(allow=("/liste/.*/department\.aspx\?categoryId=.*&pn=\d+", ))),
Rule(LinkExtractor(allow=('/liste/.*/productDetails\.aspx\?productId=.*&categoryId=.*', )), callback='parse_page'),
)
What I am supposed to--> at first rule, I want to find categories link, then send request these links and find links according to second rule and send request these links too from extracted second rule . Finally, I want to find links according to rule 3 definition and call_back parse_page function.
But it does not work like what I am supposed. Actually I can not control spider, every run-time it can scraps different pages not all pages. I want to scrap all pages that matches to rules.
How should define my rules to manage this ?

Looping on Scrapy doens't work properly

I'm trying to write a small web crawler with Scrapy.
I wrote a crawler that grabs the URLs of certain links on a certain page, and wrote the links to a csv file. I then wrote another crawler that loops on those links, and downloads some information from the pages directed to from these links.
The loop on the links:
cr = csv.reader(open("linksToCrawl.csv","rb"))
start_urls = []
for row in cr:
start_urls.append("http://www.zap.co.il/rate"+''.join(row[0])[1:len(''.join(row[0]))])
If, for example, the URL of the page I'm retrieving information from is:
http://www.zap.co.il/ratemodel.aspx?modelid=835959
then more information can (sometimes) be retrieved from following pages, like:
http://www.zap.co.il/ratemodel.aspx?modelid=835959&pageinfo=2
("&pageinfo=2" was added).
Therefore, my rules are:
rules = (Rule (SgmlLinkExtractor (allow = ("&pageinfo=\d",
), restrict_xpaths=('//a[#class="NumBtn"]',))
, callback="parse_items", follow= True),)
It seemed to be working fine. However, it seems that the crawler is only retrieving information from the pages with the extended URLs (with the "&pageinfo=\d"), and not from the ones without them. How can I fix that?
Thank you!
You can override parse_start_url() method in CrawlSpider:
class MySpider(CrawlSpider):
def parse_items(self, response):
# put your code here
...
parse_start_url = parse_items
Your rule allows urls with "&pageinfo=\d" . In effect only the pages with matching url will be processed. You need to change the allow parameter for the urls without pageinfo to be processed.

How can I scrape other specific pages on a forum with Scrapy?

I have a Scrapy Crawler that crawls some guides from a forum.
The forum that I'm trying to crawl the data has got a number of pages.
The problem is that I cannot extract the links that I want to because there aren't specific classes or ids to select.
The url structure is like this one: http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1
Obviously I can change the number after desc&page=1 to 2, 3, 4 and so on but I would like to know what is the best choice to do this.
How can I accomplish that?
PS: This is the spider code
http://dpaste.com/hold/794794/
I can't seem to open the forum URL (always redirects me to another website), so here's a best effort suggestion:
If there are links to the other pages on the thread page, you can create a crawler rule to explicitly follow these links. Use a CrawlSpider for that:
class GuideSpider(CrawlSpider):
name = "Guide"
allowed_domains = ['www.guides.com']
start_urls = [
"http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1",
]
rules = [
Rule(SgmlLinkExtractor(allow=("forumdisplay.php.*f=108.*page=",), callback='parse_item', follow=True)),
]
def parse_item(self, response):
# Your code
...
The spider should automatically deduplicate requests, i.e. it won't follow the same URL twice even if two pages link to it. If there are very similar URLs on the page with only one or two query arguments different (say, order=asc), you can specify deny=(...) in the Rule constructor to filter them out.