Scrapy Custom Function Cannot Fire scrapy.Requests - scrapy

Seems the yield scrapy.Requests cannot be fired in a function like the following code.
Anyone could give me a help to clear me or help me fire?
Really appreciate for your help.
class MySpider(CrawlSpider):
...
def start_requests(self):
yield scrapy.Request(url,
callback=self.parse_items)
...
def parse_items(self, response):
def __fire_here(response)
...
def __fire_here(response):
# Cannot fire here, why?
yield scrapy.Request(url,
callback=self.parse_items)

To avoid code duplication you can call your __fire_here function in this way:
def parse_items(self, response):
yield self.__fire_here(response)
def __fire_here(self, response):
# yield some request here
Your code seems to make endless calls from one function to another. Can you check your logics?

Related

How to send private data in spider to pipeline?

Say, each time I run scrapy like below
scrapy crawl [spidername] -a file='filename'
I want send the filename to pipeline to specify the item storage location. Each time the location may be different so it can't define in settings.py.
The file save in spider as private var
def __init__(self,file):
self.filename=file
How can I send the parameter to pipeline?
There are a few ways to do this. One way would be to define the filename as a parameter in your spider's init() method, and then pass it to your pipeline as an argument when you call the process_item() method.
Another way would be to define the filename as a class variable in your spider, and then access it from your pipeline as an attribute of the spider.
Here is an example of how you could do it using a class variable:
class MySpider(Spider):
filename = 'filename'
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
def parse(self, response):
# do stuff
def process_item(self, item, spider):
item['filename'] = spider.filename
return item

Some questions about using multiple piplines in scrapy

I'm new to scrapy and I started a simple project several days ago. I have successfully implemented items.py, my_spider.py and piplines.py to scrape some information into a json file. Now I'd like to add some features to my spider and encountered some questions.
I have already scraped the desired information on threads of a forum, including the file_urls and image_urls. I'm a little confused about the tutorial by Scrapy Documentation, here are the related parts in my files:
**settings.py**
...
ITEM_PIPELINES = {
'my_project.pipelines.InfoPipeline': 300,
'scrapy.pipelines.images.ImagesPipeline': 300,
'scrapy.pipelines.files.FilesPipeline': 300,
}
FILES_STORE = './Downloads'
IMAGES_STORE = './Downloads'
**items.py**
...
class InfoIterm(scrapy.Item):
movie_number_title = scrapy.Field()
movie_pics_links = scrapy.Field()
magnet_link = scrapy.Field()
torrent_link = scrapy.Field()
torrent_name = scrapy.Field()
class TorrentItem(scrapy.Item):
file_urls = scrapy.Field()
files = scrapy.Field()
class ImageItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
**piplines.py**
...
def process_item(self, item, spider):
contents = json.dumps(dict(item), indent=4, sort_keys=True, ensure_ascii=False)
with open("./threads_data.json", "wb") as f:
f.write(contents.encode("utf-8"))
return item
**my_spider.py**
...
def parse_thread(self, response):
json_item = InfoIterm()
json_item['movie_number_title'] = response.xpath("//span[#id='thread_subject']/text()").getall()
json_item['movie_pics_links'] = response.xpath("//td[#class='t_f']//img/#file").getall()
json_item['magnet_link'] = response.xpath("//div[#class='blockcode']/div//li/text()").getall()
json_item['torrent_name'] = response.xpath("//p[#class='attnm']/a/text()").getall()
json_item['torrent_link'] = self.base_url + response.xpath("//p[#class='attnm']/a/#href").getall()[0]
yield json_item
torrent_link = self.base_url + response.xpath("//p[#class='attnm']/a/#href").getall()
yield {'file_urls': torrent_link}
movie_pics_links = response.xpath("//td[#class='t_f']//img/#file").getall()
yield {'image_urls': movie_pics_links}
Now I can download images successfully, but files are not downloaded. My json file is also overridden by the last image_urls.
So, here are my questions:
Can one spider use multiple piplines? If possible, what's the best way to use them (For example, in my case. Some example will be great!)?
In some cases some of these json_item['xxx'] are not presented on certain threads, and the consol will print some information reporting the problem. I tried to use try-except on each line of there code, but it becomes really ugly and I believe there should be some better way to do that. What is the best way to do that?
Thanks a lot.
1- Yes you can use several pipelines, you need to mind the order in which they are called though. (More on that here)
If they are meant to process different Item objects, all you need to do is to check the class of the item received in the process_item method. Process the ones you want, return the others untouched.
2- What is the error, can't help much without that information. Please post an execution log.

Scrapy Request method‘s meta-args is shallow copy,but the Request method‘s meta-args is deep copy in scrapy_redis.Why?

scrapy:
import scrapy
from scrapy.spider import Request
class TestspiderSpider(scrapy.Spider):
name = 'testspider'
allowed_domains = ['mzitu.com']
start_urls = ['http://www.mzitu.com/']
def start_requests(self):
L =[]
print("L-id:",id(L),"first")
yield Request(url="http://www.mzitu.com/5675",callback=self.parse,meta={"L":L},dont_filter=True)
def parse(self, response):
L = response.meta.get('L')
print("L-id:", id(L),"second")
The ouput:
L-id: 2769118042568 first
L-id: 2769118042568 second
They're equal
This is shallow copy
scrapy_redis
from scrapy_redis.spiders import RedisSpider
from scrapy.spider import Request
class MzituSpider(RedisSpider): #scrapy_redis
name = 'mzitu'
redis_key = 'a:a' #this is discard
def start_requests(self): #Because Rewrite the method of RedisSpider
L =[]
print("L-id:",id(L),"first")
yield Request(url="http://www.mzitu.com/5675",callback=self.parse,meta={"L":L},dont_filter=True)
def parse(self, response):
L = response.meta.get('L')
print("L-id:", id(L),"second")
The output:
L-id: 1338852857992
first L-id: 1338852858312 second
They're not equal
This is deep copy
Question:
I want to know why?
And how can i Solve it?
Let the scrapy_redis change to become shallow copy
This has to do with the fact that scrapy-redis uses its own scheduler class which serializes/deserializes all requests through redis before pushing them further to the downloader (it keeps a queue on redis). There is no "easy" way around this as it's basically the core scrapy-redis functionality. My advise is to not put too much runtime sensitive stuff into meta as this even generally not the best idea in scrapy.

Scrapy, make http request in pipeline

Assume I have an scraped item that looks like this
{
name: "Foo",
country: "US",
url: "http://..."
}
In a pipeline I want to make a GET request to the url and check some headers like content_type and status. When the headers do not meet certain conditions I want to drop the item. Like
class MyPipeline(object):
def process_item(self, item, spider):
request(item['url'], function(response) {
if (...) {
raise DropItem()
}
return item
}, function(error){
raise DropItem()
})
Smells like this is not possible using pipelines. What do you think? Any ideas how to achieve this?
The spider:
import scrapy
import json
class StationSpider(scrapy.Spider):
name = 'station'
start_urls = ['http://...']
def parse(self, response):
jsonResponse = json.loads(response.body_as_unicode())
for station in jsonResponse:
yield station
Easy way
import requests
def process_item(self, item, spider):
response = requests.get(item['url'])
if r.status_code ...:
raise DropItem()
elif response.text ...:
raise DropItem()
else:
return item
Scrapy way
Now I think you shouldn't do this inside a Pipeline, you should treat it inside the spider not yielding an item but a request and then yielding the item.
Now if you still want to include a scrapy Request inside a pipeline you could do something like this:
class MyPipeline(object):
def __init__(self, crawler):
self.crawler = crawler
#classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_item(self, item, spider):
...
self.crawler.engine.crawl(
Request(
url='someurl',
callback=self.custom_callback,
),
spider,
)
# you have to drop the item, and send it again after your check
raise DropItem()
# YES, you can define a method callback inside the same pipeline
def custom_callback(self, response):
...
yield item
Check that we are emulating the same behaviour of spider callbacks inside the pipeline. You need to figure out a way to always drop the items when you want to do an extra request, and just pass the ones that are being by the extra callback.
One way could be sending different types of items, and check them inside the process_item of the pipeline:
def process_item(self, item, spider):
if isinstance(item, TempItem):
...
elif isinstance(item, FinalItem):
yield item

Custom scrapy xml+rss exporter to S3

I'm trying to create a custom xml feed, that will contain the spider scraped items, as well as some other high level information, stored in the spider definition. The output should be stored on S3.
The desired output looks like the following:
<xml>
<title>my title defined in the spider</title>
<description>The description from the spider</description>
<items>
<item>...</item>
</items>
</xml>
In order to do so, I defined a custom exporter, which is able to export the desired output file locally.
spider.py:
class DmozSpider(scrapy.Spider):
name = 'dmoz'
allowed_domains = ['dmoz.org']
start_urls = ['http://www.dmoz.org/Computers/']
title = 'The DMOZ super feed'
def parse(self, response):
...
yield item
exporters.py:
from scrapy.conf import settings
class CustomItemExporter(XmlItemExporter):
def __init__(self, *args, **kwargs):
self.title = kwargs.pop('title', 'no title found')
self.link = settings.get('FEED_URI', 'localhost')
super(CustomItemExporter, self).__init__(*args, **kwargs)
def start_exporting(self):
...
self._export_xml_field('title', self.title)
...
settings.py:
FEED_URI = 's3://bucket-name/%(name)s.xml'
FEED_EXPORTERS = {
'custom': 'my.exporters.CustomItemExporter',
}
I'm able to run the whole thing and get the output on s3 by running the following command:
scrapy crawl dmoz -t custom
or, if I want to export a json locally instead: scrapy crawl -o dmoz.json dmoz
But at this point, I'm unable to retrieve the spider title to put it in the output file.
I tried implementing a custom pipeline, which outputs data locally (following numerous examples):
pipelines.py:
class CustomExportPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_feed.xml' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CustomItemExporter(
file,
title = spider.title),
)
self.exporter.start_exporting()
The problem is, the file is stored locally, and this short circuits the FeedExporter logic defined in feedexport.py, that handles all the different storages.
No info from the FeedExporter is available in the pipeline, and I would like to reuse all that logic without duplicating code. Am I missing something? Thanks for any help.
Here's my solution:
get rid of the pipeline.
Override scrapy's FeedExporter
myproject/feedexport.py:
from scrapy.extensions.feedexport import FeedExporter as _FeedExporter
from scrapy.extensions.feedexport import SpiderSlot
class FeedExporter(_FeedExporter):
def open_spider(self, spider):
uri = self.urifmt % self._get_uri_params(spider)
storage = self._get_storage(uri)
file = storage.open(spider)
extra = {
# my extra settings
}
exporter = self._get_exporter(file, fields_to_export=self.export_fields, extra=extra)
exporter.start_exporting()
self.slot = SpiderSlot(file, exporter, storage, uri)
All I wanted to do was basically to pass those extra settings to the exporter, but the way it's built, there is no choice but to override.
To support other scrapy export formats simultaneously, I would have to consider overriding the dont_fail settings to True in some scrapy exporters to prevent them from failing
Replace scrapy's feed exporter by the new one
myproject/feedexport.py:
EXTENSIONS = {
'scrapy.extensions.feedexport.FeedExporter': None,
'myproject.feedexport.FeedExporter': 0,
}
... or the 2 feed exporters would run at the same time