Override Scrapy output format 'on the fly' - scrapy

I want to override spider's output format right in the code. I can't modify the settings; I can't change the command line. I want to do it right in the __init__ method.
Ideally, that new output format should work even if something like -o /tmp/1.csv is passed to the spider. But if it's not possible, then pass it.
How can I do that?
Thank you.

So, you can put a custom attribute in your spider that sets up how the data should be handled for this spider and create a Scrapy item pipeline that honors that configuration.
Your spider code would look like:
from scrapy import Spider
class MySpider(Spider):
def __init__(self, *args, **kwargs):
super(MySpider, self).__init(*args, **kwargs)
self.data_destination = self._get_data_destination()
def _get_data_destination(self):
# return your dynamically discovered data destination settings here
And your item pipeline would be something like:
class MySuperDuperPipeline(object):
def process_item(self, item, spider):
data_destination = getattr(spider, 'data_destination')
# code to handle item conforming to data destination here...
return item

Related

How can I dynamically generate pytest parametrized fixtures from imported helper methods?

What I want to achieve is basically this but with a class scoped, parametrized fixture.
The problem is that if I import the methods (generate_fixture and inject_fixture) from a helper file the inject fixture code seems to be getting called too late. Here is a complete, working code sample:
# all of the code in one file
import pytest
import pytest_check as check
def generate_fixture(params):
#pytest.fixture(scope='class', params=params)
def my_fixture(request, session):
request.cls.param = request.param
print(params)
return my_fixture
def inject_fixture(name, someparam):
globals()[name] = generate_fixture(someparam)
inject_fixture('myFixture', 'cheese')
#pytest.mark.usefixtures('myFixture')
class TestParkingInRadius:
def test_custom_fixture(self):
check.equal(True, self.param, 'Sandwhich')
If I move the generate and inject helpers into their own file (without changing them at all) I get a fixture not found error i.e. if the test file looks like this instead:
import pytest
import pytest_check as check
from .helpers import inject_fixture
inject_fixture('myFixture', 'cheese')
#pytest.mark.usefixtures('myFixture')
class TestParkingInRadius:
def test_custom_fixture(self):
check.equal(True, self.param, 'Sandwhich')
The I get an error at setup: E fixture 'myFixture' not found followed by a list of available fixtures (which doesn't include the injected fixture).
Could someone help explain why this is happening? Having to define those functions in every single test file sort of defeats the whole point of doing this (keeping things DRY).
I figured out the problem.
Placing the inject fixture method in a different file changes the global scope of that method. The reason it works inside the same file is both the caller and inject fixture method share the same global scope.
Using the native inspect package and getting the scope of the caller solved the issue, here it is with full boiler plate working code including class introspection via the built in request fixture:
import inspect
import pytest
def generate_fixture(scope, params):
#pytest.fixture(scope=scope, params=params)
def my_fixture(request):
request.cls.param = request.param
print(request.param)
return my_fixture
def inject_fixture(name, scope, params):
"""Dynamically inject a fixture at runtime"""
# we need the caller's global scope for this hack to work hence the use of the inspect module
caller_globals = inspect.stack()[1][0].f_globals
# for an explanation of this trick and why it works go here: https://github.com/pytest-dev/pytest/issues/2424
caller_globals[name] = generate_fixture(scope, params)

triggering a function after the finish of specific Request in scrapy

I have a complex scraping application in Scrapy that run at multiple stages (each stage is a function calling the next stage of scraping and parsing). the spider try to download multiple targets and each target consists of large number of files. what i need to do is after downloading all the files of a target is calling some function that process them and it cannot process them partially it needs the whole set of files for the target at the same time. is there a way to do it ?
If you cannot wait until the whole spider is finished, you will have to write some logic in an item pipeline that keeps track of what you have scraped, and executes a function then.
Below is some logic to get you started: it keeps track of the number items you scraped per target, and when it reaches 100, it will execute the target_complete method. Note that you will have to fill in the field 'target' in the item of course.
from collections import Counter
class TargetCountPipeline(object):
def __init__(self):
self.target_counter = Counter()
self.target_number = 100
def process_item(self, item, spider):
target = item['target']
self.target_counter[target] += 1
if self.target_counter[target] >= self.target_number:
target_complete(target)
return item
def target_complete(self, target):
# execute something here when you reached the target

How to access `request_seen()` inside Spider?

I have a Spider and I have a situation where I want to check if the request I am going to schedule already exists in request_seen() or not?
I don't want any method to check inside a download/spider middleware, I just want to check inside my Spider.
Is there any way to call that method?
You should be able to access the dupe filter itself from the spider like this:
self.dupefilter = self.crawler.engine.slot.scheduler.df
then you could use that in other places to check:
req = scrapy.Request('whatever')
if self.dupefilter.request_seen(req):
# it's already been seen
pass
else:
# never saw this one coming
pass
I did something similar to yours with pipeline. Following command is the code that I use.
You should specify an identifier and then go with it to check whether it is seen or not.
class SeenPipeline(object):
def __init__(self):
self.isbns_seen = set()
def process_item(self, item, spider):
if item['isbn'] in self.isbns_seen:
raise DropItem("Duplicate item found : %s" %item)
else:
self.isbns_seen.add(item['isbn'])
return item
Note: You can use these codes within your spider, too

scrapyd multiple spiders writing items to same file

I have scrapyd server with several spiders running at same time, I start the spiders one by one using the schedule.json endpoint. All spiders are writing contents on common file using a pipeline
class JsonWriterPipeline(object):
def __init__(self, json_filename):
# self.json_filepath = json_filepath
self.json_filename = json_filename
self.file = open(self.json_filename, 'wb')
#classmethod
def from_crawler(cls, crawler):
save_path='/tmp/'
json_filename=crawler.settings.get('json_filename', 'FM_raw_export.json')
completeName = os.path.join(save_path, json_filename)
return cls(
completeName
)
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
After the spiders are running I can see how they are collecting data correctly, items are stored in files XXXX.jl and the spiders works correctly, however the contents crawled are not reflected on common file. Spiders seems to work well however the pipeline is not doing well their job and is not collecting data into common file.
I also noticed that only one spider is writing at same time on file.
I don't see any good reason to do what you do :) You can change the json_filename setting by setting arguments on your scrapyd schedule.json Request. Then you can make each spider to generate slightly different files that you merge with post-processing or at query time. You can also write JSON files similar to what you have by just setting the FEED_URI value (example). If you write to single file simultaneously from multiple processes (especially when you open with 'wb' mode) you're looking for corrupt data.
Edit:
After understanding a bit better what you need - in this case - it's scrapyd starting multiple crawls running different spiders where each one crawls a different website. The consumer process is monitoring a single file continuously.
There are several solutions including:
named pipes
Relatively easy to implement and ok for very small Items only (see here)
RabbitMQ or some other queueing mechanism
Great solution but might be a bit of an overkill
A database e.g. SQLite based solution
Nice and simple but likely requires some coding (custom consumer)
A nice inotifywait-based or other filesystem monitoring solution
Nice and likely easy to implement
The last one seems like the most attractive option to me. When scrapy crawl finishes (spider_closed signal), move, copy or create a soft link for the FEED_URL file to a directory that you monitor with a script like this. mv or ln is an atomic unix operation so you should be fine. Hack the script to append the new file on your tmp file that you feed once to your consumer program.
By using this way, you use the default feed exporters to write your files. The end-solution is so simple that you don't need a pipeline. A simple Extension should fit the bill.
On an extensions.py in the same directory as settings.py:
from scrapy import signals
from scrapy.exceptions import NotConfigured
class MoveFileOnCloseExtension(object):
def __init__(self, feed_uri):
self.feed_uri = feed_uri
#classmethod
def from_crawler(cls, crawler):
# instantiate the extension object
feed_uri = crawler.settings.get('FEED_URI')
ext = cls(feed_uri)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
# return the extension object
return ext
def spider_closed(self, spider):
# Move the file to the proper location
# os.rename(self.feed_uri, ... destination path...)
On your settings.py:
EXTENSIONS = {
'myproject.extensions.MoveFileOnCloseExtension': 500,
}

Is Scrapy able to scrape the data once I initialise it's object?

Is it possible for Scrapy to do like when I call the function scrape.crawl("website") in a class, it would redirect to the class where the scraping codes are and execute the function.
I tried to find in various sources and mostly asked me to write it as a script form. But couldn't find any working example that shows me how to initialise the object so as to call the script.
Came close to this code but it's not working.
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
loader = DmozItemLoader(DmozItem(), selector=sel, response=response)
loader.add_xpath('title', 'a/text()')
loader.add_xpath('link', 'a/#href')
loader.add_xpath('desc', 'text()')
yield loader.load_item()
Calling the object?
spider = DmozSpider()
Any kind souls with working example with what I want?
For this you need a quite complex structure -- if I understand your question right.
If you have your instance of the spider you need to set up a Crawler and start it afterwards. For example:
crawler = Crawler(get_project_settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
This is the base but you should be able to get started with this. However as I've said previously it is quite complex and you need some configuration beside this to get it running.
Update
If you have a URL and want Scrapy to crawl that site you could do it like this:
def __init__(self, url, *args, **kwargs):
super(DmozSpider, self).__init__(*args, **kwargs)
self.start_urls = [url]
And then start crawling like described above. Because Scrapy spiders start as soon as you call them you need the right sequence of setting the start URL and then starting.