Scraping a list of urls after log in - scrapy

The site to scrape has multiple projects with multiple pages and requires a log in. I tried:
def start_request(self):
return [scrapy.FormRequest(, callback=self.logged_in)]
def logged_in(self, response):
with open(...) as f:
for url in f.readlines():
return scrapy.Request(url, callback=self.parse)
def parse(self, response):
... do some scraping ...
... determine the url of the next page ...
return scrapy.Request(... next page ..., self.parse)
This results in scraping all pages of one project (login is successful), but then it stops.
If return scrapy.Request() in function logged_in() is replaced by yield scrapy.Request() than it reads the first page of all projects.
I played around with the returns and yields, but I can't get it to scrape all pages of all projects.
BTW I tried to create an array start_uls, but that doesn't work because it first needs to log into the site.

return will always return once, so don't expect more than one Request to be returned there, if you wish to return more than once, use yield
Scrapy Requests has a parameter called dont_filter that probably filter your calls to parse function
dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.

#Guy Gavriely makes a good point about return. I would add that you need another parse method to filter all the pages you want:
def start_request(self):
return [scrapy.FormRequest(, callback=self.logged_in)]
def logged_in(self, response):
with open(...) as f:
for url in f.readlines():
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
... do some scraping ...
yield scrapy.Request(... next page ..., self.parse_user)
def parse_user(self, response):
... do some scraping ....
yield items
You're not done yet! The final parse method may need to be iterated for example:
def parse(self, response):
... do some scraping ...
for sel in (some_list_of_users_by_an_xpath):
user_profile_tag = response.xpath('xpath_to_user_profile_urls')
user_profile_url_clean = user_profile_tag(might need to clean prior to Request ie 'domain' + user_profile_tag or .split or .replace, etc.)
yield scrapy.Request(user_profile_url_clean, self.parse_user)
In this case the parse function will parse a user every time in that list of users. Then parse_user will do most of the actual digging and scraping. Once it's done it will return to the original parse method and just go to the next one from the list.
Good luck!

Related

Spider through Scrapyrt returns 0 items when scraping search page

I created a spider that can scrape a page in an e-commerce site and gather the data on the different items.
The spider works fine with specific pages of the site (www.sitedomain/123-item-category), as well as with the search page (www.sitedomain/searchpage?controller?search=keywords+item+to+be+found).
But, when I run it through Scrapyrt the specific page works fine, but the search page returns 0 items. No errors, just 0 items.This occurs on 2 different sites with 2 different spiders.
Is there something specific to search pages that has to be taken in account when using Scrapyrt?
Take a spider like this:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "minimal"
def start_requests(self):
urls = [
"https://www.dungeondice.it/ricerca?controller=search&s=ticket+to+ride",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print("Found ", len(response.css("article")), " items")
for article in response.css("article"):
print("Item: ", article.css("img::attr(title)").get())
(I also set the Obey_robots = False)
I get 20 items back, if I run:
scrape crawl minimal
But I get 0 items back (no errors, just no results), if I run:
curl "http://localhost:9081/crawl.json?spider_name=minimal&url=https://www.dungeondice.it/ricerca?controller=search&s=ticket+to+ride"
I think you could put "return item" in the end of "process_item" in your pipeline.py:
def process_item(self, item, spider):
...
return item
I found the same issue as you described, after I edit my codes, I solved it.
Hope this could help =)

Is there a way to get the URL that a link is scraped from?

I've got a spider written out that crawls my website and scrapes a bunch of tags. I'm now trying to have it return the URL that the link was discovered on.
For example:
www.example.com/product/123 was found on www.example.com/page/2.
When scrapy scrapes information from /product/123 I want to have a field that is "Scraped From" and return /page/2. For every URL that is scraped, I'd want to find the originating page that the URL was found. I've been pouring over the docs and can't seem to figure this out. Any help would be appreciated!
The easiest way is to use the response.headers. There should be a referer header.
referer = response.headers['Referer']
You can also use meta to pass information along to the next URL.
def parse(self, response):
product_url = response.css('#url').get()
yield scrapy.Request(product_url, callback=self.parse_product, meta={'referer': response.url})
def parse_product(self, response):
referer = response.meta['referer']
item = ItemName()
item['referer'] = referer
yield item

Collect items from multiple requests in an array in Scrapy

I wrote a small example spider to illustrate my problem:
class ListEntrySpider(scrapy.Spider):
start_urls = ['https://example.com/lists']
def parse(self, response):
for i in json.dumps(response.text)['ids']:
scrapy.Request(f'https://example.com/list/{i}', callback=self.parse_lists)
def parse_lists(self, response):
for entry in json.dumps(response.text)['list']:
yield ListEntryItem(**entry)
I need to have all the items that result from multiple requests (all ListEntryItems in an array inside the spider, so dispatch requests that depend on all items.
My first idea was to chain the requests and pass the remaining IDs and the already extracted items in the request's meta attribute until the last request is reached.
class ListEntrySpider(scrapy.Spider):
start_urls = ['https://example.com/lists']
def parse(self, response):
ids = json.dumps(response.text)['ids']
yield self._create_request(ids, [])
def parse_lists(self, response):
self._create_request(response.meta['ids'], response.meta['items'].extend(list(self._extract_lists(response))))
def finish(self, response):
items = response.meta['items'].extend(list(self._extract_lists(response)))
def _extract_lists(self, response):
for entry in json.dumps(response.text)['list']:
yield ListEntryItem(**entry)
def _create_request(self, ids: list, items: List[ListEntryItem]):
i = ids.pop(0)
return scrapy.Request(
f'https://example.com/list/{i}',
meta={'ids': ids, 'items': items},
callback=self.parse_lists if len(ids) > 1 else self.finish
)
As you can see, my solution looks very complex. I'm looking for something more readable and less complex.
there are different approaches for this. One is chaining as you do. Problems occur is one of the requests in the middle of the chain is dropped for any reason. Your have to be really careful about that and handle all possible errors / ignored requests.
Another approach is to use a separate spider for all "grouped" requests.
You can start those spiders programmatically and pass a bucket (e.g. a dict) as spider attribute. Within your pipeline you add your items from each request to this bucket. From "outside" you listen to the spider_closed signal and get this bucket which then contains all your items.
Look here for how to start a spider programatically via a crawler runner:
https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
pass a bucket to your spider when calling crawl() of your crawler runner
crawler_runner_object.crawl(YourSpider, bucket=dict())
and catch the sider_closed signal
from scrapy.signalmanager import dispatcher
def on_spider_closed(spider):
bucket = spider.bucket
dispatcher.connect(on_spider_closed, signal=signals.spider_closed)
this approach might seem even more complicated than chaining your requests but it actually takes a lot of complexity out of the problem as within your spider you can make your requests without taking much care about all the other requests.

Scrapy stopping conditions

Illustrative scenario: A Scrapy spider is built to scrape restaurant menus from a start_urls list of various restaurant websites. Once the menu is found for each restaurant, it is no longer necessary to continue crawling that particular restaurant website. The spider should (ideally) abort the queue for that start_url and move on to the next restaurant.
Is there a way to stop Scrapy from crawling the remainder of its request queue *per start_url* once a stopping condition is satisfied? I don't think that a CloseSpider exception is appropriate since I don't want to stop the entire spider, just the queue of the current start_url, and then move on to the next start_url.
Dont't use scrapy rules.
All what you need:
start_urls = [
'http://url1.com', 'http://url2.com', ...
]
def start_requests(self):
for url in self.start_urls:
yield Request(url, self.parse_url)
def parse_url(self, response):
hxs = Selector(response)
item = YourItem()
# process data
return item
And don't forget add all domains to allowed_domains list.

How to include the start url in the "allow" rule in SgmlLinkExtractor using a scrapy crawl spider

I have searched a lot of topics but does not seem to find the answer for my specific question.
I have created a crawl spider for a website and it works perfectly. I then made a similar one to crawl a similar website but this time I have a small issue. Down to the business:
my start url looks as follows: www.example.com . The page contains the links I want to apply my spider look like:
www.example.com/locationA
www.example.com/locationB
www.example.com/locationC
...
I now have a issue:
Every time when I enter the start url, it redirects to www.example.com/locationA automatically and all links I got my spider working include
www.example.com/locationB
www.example.com/locationC
...
So my problem is how I can include the www.example.com/locationA in the returned URLs.I even got the log info like:
-2011-11-28 21:25:33+1300 [example.com] DEBUG: Redirecting (302) to from http://www.example.com/>
-2011-11-28 21:25:34+1300[example.com] DEBUG: Redirecting (302) to (referer: None)
2011-11-28 21:25:37+1300 [example.com] DEBUG: Redirecting (302) to (referer: www.example.com/locationB)
Print out from parse_item: www.example.com/locationB
....
I think the issue might be related to that (referer: None) some how. Could anyone please shed some light on this??
I have narrow down this issue by changing the start url to www.example.com/locationB. Since all the pages contain the lists of all locations, this time I got my spider works on:
-www.example.com/locationA
-www.example.com/locationC
...
In a nut shell, I am looking for the way to include the url which is same as (or being redirected from) the start url into the list that the parse_item callback will work on.
For others have the same problem, after a lot of searching, all you need to do is name your callback function to parse_start_url.
Eg:
rules = (
Rule(LinkExtractor(allow=(), restrict_xpaths=(
'//*[contains(concat( " ", #class, " " ), concat( " ", "pagination-next", " " ))]//a',)), callback="parse_start_url", follow=True),
)
Adding sample code based on mindcast suggestion:
I manage using following approach
class ExampleSpider(CrawlSpider):
name = "examplespider"
allowed_domains = ["example.com"]
start_urls = ['http://example.com/A']
rules = (Rule (SgmlLinkExtractor(restrict_xpaths=("//div[#id='tag_cloud']/a",)), callback="parse_items", follow= True),)
def parse_start_url(self, response):
self.log('>>>>>>>> PARSE START URL: %s' % response)
# www.example.com/A will be parsed here
return self.parse_items(response)
def parse_items(self, response):
self.log('>>>>>>>> PARSE ITEM FROM %s' % response.url)
"""Scrape data from links based on Crawl Rules"""
At first I thought that there is a simple solution using start_requests() like:
def start_requests(self):
yield Request('START_URL_HERE', callback=self.parse_item)
But tests showed, that when start_requests() is used instead of a start_urls list, spider ignores rules, because CrawlSpider.parse(response) is not called.
So, here is my solution:
import itertools
class SomeSpider(CrawlSpider):
....
start_urls = ('YOUR_START_URL',)
rules = (
Rule(
SgmlLinkExtractor(allow=(r'YOUR_REGEXP',),),
follow=True,
callback='parse_item'),
),
)
def parse(self, response):
return itertools.chain(
CrawlSpider.parse(self, response),
self.parse_item(response))
def parse_item(self, response):
yield item
Perhaps there is a better way.
A simple workaround is to specifically add a rule for the start_urls(in your case: http://example.com/locationA )(please ignore indentation issue):
class ExampleSpider(CrawlSpider):
name = "examplespider"
allowed_domains = ["example.com"]
start_urls = ['http://example.com/locationA']
rules = (
Rule(LinkExtractor(allow=('locationA')), callback='parse_item'),
Rule(LinkExtractor(allow=('location\.*?'),restrict_css=('.pagination-next',)), callback='parse_item', follow=True),
)
def parse_item(self, response):
self.log('>>>>>>>> PARSE ITEM FROM %s' % response.url)