Using multiple spiders at in the project in Scrapy - scrapy

I wanna know if it is possible to use multiple spiders within the same project together. Actually I need 2 spiders. The first one gathers the links on which the second spider should scrape. They both work on the same website, so the domain is similar.Is it possible? If yes can you give me an example?
Thanks

Maybe this is what you're looking for:
def parse(self, response):
# parse the links (aka your first spider)
for link in hxs('//XPATH'):
yield Request(link.extract(), callback=self.parse_link)
def parse_link(self, response):
# continue parsing (aka your second spider)
Hope this help you :)

Related

Follow links to end in Scrapy tutorials

Not sure if this is the correct place to ask such a general question.
But, I can't seem to find any examples or tutorials using Scrapy to scrape a website by following links to an end page which would then have the product details which I wish to extract information from for that product.
So I start from the main web page where I can scrape the tags for the href, but how do I then follow each link, which would take me to another page which will have more href links which, if I follow again will then ultimately take me to the product info itself where the data I want to extract resides
Is this some kind of recursion? Sorry, but I am new to this. So does anyone know of a good tutorial/ example? I find it a bit difficult to follow the official documentation.
You can find a couple of examples here: https://github.com/scrapy/quotesbot
Also, here is an example to parse books on http://books.toscrape.com/ :
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'toscrape.com'
start_urls = ['http://books.toscrape.com/']
rules = (
# The links that matches the allow / deny arguments will be processed in parse_book.
Rule(LinkExtractor(allow=('/catalogue/',), deny=('/category/', '/page')), callback='parse_book'),
# These links will be processed in the CrawlSpider defaut callback that will look for new links.
Rule(LinkExtractor(allow=('/page',))),
)
def parse_book(self, response):
yield {
'title': response.css('div.product_main>h1::text').extract_first(),
'price': response.css('p.price_color::text').extract_first(),
}
When you use a CrawlSpider like in this example, scrapy does automatically extracts the links for you and iterates though each of them until no more can be found.
I used the Scrapy documentation to do this.
You can see my example I have here;
https://github.com/dbaleeds/ScrapyQuoteExtractor/blob/master/quotes/spiders/quotesBrainy.py
This is same thing you are trying to do, reading links from a page, following the link then reading in the data from the result pages.
def parse(self, response):
Reads the links page.
def parse_item(self, response):
Parses the data within in page, from the link it followed above.
I would suggest to implement this to see how it works then use this as a base to build your own project with.

Does Scrapy work on .cfm files?

So far my experience with Scrapy spiders are for a focused scrapying. In another word, I first do a manual keyword search on a target website, which return a http address containing the keywords, an example is http://www.simplyhired.com/search?q=Anesthesiologist. This weblink will let my spider "see" what I got in a browser.
Now I noticed this method doesn't work on some websites, such as this one: http://www.physicianjobboard.com/. Keyword searching works on a browser, but only generates a generic weblink of http://www.mdjobsite.com/Index2.cfm?Page=JobsSearchResults. This generic web link contains .cfm file and would not directly let my spiders know which keywords I am interested in.
One inefficient method would be scrapying all the posts from this website and filter out those I need. Are there another method to let my spiders see what I got in my browser and perform a focused scrapying? My guess is to let spider send a request mimicking keyword search and then analyze the response page. I have zero experience on this. Could anyone give some hints if my guess is correct?

Which is the better way to use Scrapy to crawl 1000 sites?

I'd like to hear the diffrences between 3 different approaches for using Scrapy in order to crawl 1000 sites.
For example, I want to scrape 1000 photo sites, they all most has the same structure.Like have one kind of photo list page,and other kind of big photo page; but these list or photo desc page's HTML code will not all the same.
Another example,I want to scrape 1000 wordpress blog,Only bolg's article.
The first, is exploring the entire 1000 sites using one scrapy project.
The second, is having all these 1000 sites under the same scrapy project, all items in items.py, each site having it's own spider.
The third is similar to the second, but having one spider for all the sites instead of seperating them.
What are the diffrences, and which do you think is the right approach? Is there any other, better approach I've missed?
I had 90 sites to pull from so it wasn't great option to create one crawler per site. The idea was to be able to run in parallel. Also i had split this to pack similar page formats in one place.
So I ended up with 2 crawlers:
Crawler 1 - URL Extractor. This would extract all detail page URLs from top level listing page in a file(s).
Crawler 2 - Fetch Details.
This would read from the URL file and extract item details.
This allowed me to fetch URLs first and estimate number of threads that i might need for second crawler.
Since each crawler was working on specific page format, there were quite a few functions I could reuse.

Follow only child links using Scrapy

I'm new to Scrapy and I'm not sure how to tell it to only follow links that are subpages of the current url. For example, if you are here:
www.test.com/abc/def
then I want scrapy to follow:
www.test.com/abc/def/ghi
www.test.com/abc/def/jkl
www.test.com/abc/def/*
but not:
www.test.com/abc/*
www.test.com/*
or any other domain for that matter.
http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example
Write a spider deriving on the BaseSpider. In basespider parse call back you need to return the requests which you need to follow through. Just make sure the the request you are generating is of the form you like. i.e. the extracted url from the response using is a child of the current url( this will be response url). And make a request object and yield them.

Use Scrapy to cut down on Piracy

I am new to using Scrapy and I know very little of the Python Language. So far, I have installed Scrapy and gone through a few tutorials. After that, I have been trying to find a way to search many sites for the same data. My goal is to use Scrapy to find links to "posts" and links for a few search criteria. As an example, I would like to search site A, B, and C. Each site, I would like to see if they have a "post" about app name X, Y, and Z. If they have any "posts" on X, Y, Z. I would like it to grab the link to that post. If it would be easier... It can scan each post for our Company Name. Instead of X, Y, Z it would search the contents of each "post" for [Example Company name]. The reason that I am doing it this way is so that the JSON that is created just has links to the "posts" so that we can review them and contact the website if need be.
I am on Ubuntu 10.12 and I have been able to scrape the sites that we are wanting but I have not been able to narrow down the JSON to the needed info. So currently we are still having to go through hundreds of links, which is what we wanted to avoid by doing this. The reason that we are getting so many links is because all the tutorials that I have found are for scraping a specific HTML tag. I want it to search the tag to see if it contains any part of our App Titles or Package name.
Like this, it displays the post info but now we have to pick out the links from the json. Saves time but still not really what we are wanting. Part of that, I think is that I am not referencing or calling it correctly. Please give me any help that you can. I have spent hours trying to figure this out.
posts = hxs.select("//div[#class='post']")
items = []
for post in posts:
item = ScrapySampleItem()
item["title"] = post.select("div[#class='bodytext']/h2/a/text()").extract()
item["link"] = post.select("div[#class='bodytext']/h2/a/#href").extract()
item["content"] = post.select("div[#class='bodytext']/p/text()").extract()
items.append(item)
for item in items:
yield item
I am wanting to use this to cut down on Piracy of our Android Apps. If I can have this go out and search the Piracy sites that we want, I can then email the Site or Hosting Company with all of the links that we want removed. Under Copy Right law, they have to comply but they require that we link them to every "post" that they infringe upon which is why App Developers normally do not mess with this kind of thing. They have hundreds of apps so finding the links to your apps takes many hours of work.
Thank you for any help you can offer in advance. You will be helping out many App Developers in the long run!
Grady
Your XPath selectors are absolute. They have to be relative to the previous selector (the .)
posts = hxs.select('//div[#class='post']')
for post in posts:
item = ScrapySampleItem()
item['title'] = post.select('.//div[#class="bodytext"]/h2/a/text()').extract()
item['link'] = post.select('.//div[#class="bodytext"]/h2/a/#href').extract()
item['content'] = post.select('.//div[#class="bodytext"]/p/text()').extract()
yield item