How can I override a pyinvoke previously defined rule, but "super()" call the older implementation? - invoke

I have a standard_tasks.py module that provides a task for build
#task
def build(ctx):
do_this()
from my tasks.py I am currently doing
from standard_tasks import *
_super_build = build.body
#task
def _build(ctx):
# Here I want to call the older implementation that triggered do_this()
_super_build(ctx)
do_that()
build.body = _build
However, this feels clunky. I was wondering if there's a better way?

Related

Scrapy concurrent spiders instance variables

I have a number of Scrapy spiders running and recently had a strange bug. I have a base class and a number of sub classes:
class MyBaseSpider(scrapy.Spider):
new_items = []
def spider_closed(self):
#Email any new items that weren't in the last run
class MySpiderImpl1(MyBaseSpider):
def parse(self):
#Implement site specific checks
self.new_items.append(new_found_item)
class MySpiderImpl2(MyBaseSpider):
def parse(self):
#Implement site specific checks
self.new_items.append(new_found_item)
This seems to have been running well, new items get emailed to me on a per-site basis. However I've recently had some emails from MySpiderImpl1 which contain items from Site 2.
I'm following the documentation to run from a script:
scraper_settings = get_project_settings()
runner = CrawlerRunner(scraper_settings)
configure_logging()
sites = get_spider_names()
for site in sites:
runner.crawl(site.spider_name)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
I suspect the solution here is to switch to a pipeline which collates the items for a site and emails them out when pipeline.close_spider is called but I was surprised to see the new_items variable leaking between spiders.
Is there any documentation on concurrent runs? Is it bad practice to keep variables on a base class? I do also track other pieces of information on the spiders in variables such as the run number - should this be tracked elsewhere?
In python all class variables are shared between all instances and subclasses. So your MyBaseSpider.new_items is the exact same list that is used by MySpiderImpl1.new_items and MySpiderImpl2.new_items.
As you suggested you could implement a pipeline, although this might require significantly refactoring your current code. It could look something like this.
pipelines.py
class MyPipeline:
def process_item(self, item, spider):
if spider.name == 'site1':
... email item
elif spider.name == 'site2':
... do something different
I am assuming all of your spiders have names... I think it's a requirement.
Another option that probably requires less effort might be to override the start_requests method in your base class to assign a unique list at start of the crawling process.

How can I dynamically generate pytest parametrized fixtures from imported helper methods?

What I want to achieve is basically this but with a class scoped, parametrized fixture.
The problem is that if I import the methods (generate_fixture and inject_fixture) from a helper file the inject fixture code seems to be getting called too late. Here is a complete, working code sample:
# all of the code in one file
import pytest
import pytest_check as check
def generate_fixture(params):
#pytest.fixture(scope='class', params=params)
def my_fixture(request, session):
request.cls.param = request.param
print(params)
return my_fixture
def inject_fixture(name, someparam):
globals()[name] = generate_fixture(someparam)
inject_fixture('myFixture', 'cheese')
#pytest.mark.usefixtures('myFixture')
class TestParkingInRadius:
def test_custom_fixture(self):
check.equal(True, self.param, 'Sandwhich')
If I move the generate and inject helpers into their own file (without changing them at all) I get a fixture not found error i.e. if the test file looks like this instead:
import pytest
import pytest_check as check
from .helpers import inject_fixture
inject_fixture('myFixture', 'cheese')
#pytest.mark.usefixtures('myFixture')
class TestParkingInRadius:
def test_custom_fixture(self):
check.equal(True, self.param, 'Sandwhich')
The I get an error at setup: E fixture 'myFixture' not found followed by a list of available fixtures (which doesn't include the injected fixture).
Could someone help explain why this is happening? Having to define those functions in every single test file sort of defeats the whole point of doing this (keeping things DRY).
I figured out the problem.
Placing the inject fixture method in a different file changes the global scope of that method. The reason it works inside the same file is both the caller and inject fixture method share the same global scope.
Using the native inspect package and getting the scope of the caller solved the issue, here it is with full boiler plate working code including class introspection via the built in request fixture:
import inspect
import pytest
def generate_fixture(scope, params):
#pytest.fixture(scope=scope, params=params)
def my_fixture(request):
request.cls.param = request.param
print(request.param)
return my_fixture
def inject_fixture(name, scope, params):
"""Dynamically inject a fixture at runtime"""
# we need the caller's global scope for this hack to work hence the use of the inspect module
caller_globals = inspect.stack()[1][0].f_globals
# for an explanation of this trick and why it works go here: https://github.com/pytest-dev/pytest/issues/2424
caller_globals[name] = generate_fixture(scope, params)

scrapyd multiple spiders writing items to same file

I have scrapyd server with several spiders running at same time, I start the spiders one by one using the schedule.json endpoint. All spiders are writing contents on common file using a pipeline
class JsonWriterPipeline(object):
def __init__(self, json_filename):
# self.json_filepath = json_filepath
self.json_filename = json_filename
self.file = open(self.json_filename, 'wb')
#classmethod
def from_crawler(cls, crawler):
save_path='/tmp/'
json_filename=crawler.settings.get('json_filename', 'FM_raw_export.json')
completeName = os.path.join(save_path, json_filename)
return cls(
completeName
)
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
After the spiders are running I can see how they are collecting data correctly, items are stored in files XXXX.jl and the spiders works correctly, however the contents crawled are not reflected on common file. Spiders seems to work well however the pipeline is not doing well their job and is not collecting data into common file.
I also noticed that only one spider is writing at same time on file.
I don't see any good reason to do what you do :) You can change the json_filename setting by setting arguments on your scrapyd schedule.json Request. Then you can make each spider to generate slightly different files that you merge with post-processing or at query time. You can also write JSON files similar to what you have by just setting the FEED_URI value (example). If you write to single file simultaneously from multiple processes (especially when you open with 'wb' mode) you're looking for corrupt data.
Edit:
After understanding a bit better what you need - in this case - it's scrapyd starting multiple crawls running different spiders where each one crawls a different website. The consumer process is monitoring a single file continuously.
There are several solutions including:
named pipes
Relatively easy to implement and ok for very small Items only (see here)
RabbitMQ or some other queueing mechanism
Great solution but might be a bit of an overkill
A database e.g. SQLite based solution
Nice and simple but likely requires some coding (custom consumer)
A nice inotifywait-based or other filesystem monitoring solution
Nice and likely easy to implement
The last one seems like the most attractive option to me. When scrapy crawl finishes (spider_closed signal), move, copy or create a soft link for the FEED_URL file to a directory that you monitor with a script like this. mv or ln is an atomic unix operation so you should be fine. Hack the script to append the new file on your tmp file that you feed once to your consumer program.
By using this way, you use the default feed exporters to write your files. The end-solution is so simple that you don't need a pipeline. A simple Extension should fit the bill.
On an extensions.py in the same directory as settings.py:
from scrapy import signals
from scrapy.exceptions import NotConfigured
class MoveFileOnCloseExtension(object):
def __init__(self, feed_uri):
self.feed_uri = feed_uri
#classmethod
def from_crawler(cls, crawler):
# instantiate the extension object
feed_uri = crawler.settings.get('FEED_URI')
ext = cls(feed_uri)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
# return the extension object
return ext
def spider_closed(self, spider):
# Move the file to the proper location
# os.rename(self.feed_uri, ... destination path...)
On your settings.py:
EXTENSIONS = {
'myproject.extensions.MoveFileOnCloseExtension': 500,
}

Flask + SQLAlchemy + pytest - not rolling back my session

There are several similar questions on stack overflow, and I apologize in advance if I'm breaking etiquette by asking another one, but I just cannot seem to come up with the proper set of incantations to make this work.
I'm trying to use Flask + Flask-SQLAlchemy and then use pytest to manage the session such that when the function-scoped pytest fixture is torn down, the current transation is rolled back.
Some of the other questions seem to advocate using the db "drop all and create all" pytest fixture at the function scope, but I'm trying to use the joined session, and use rollbacks, since I have a LOT of tests. This would speed it up considerably.
http://alexmic.net/flask-sqlalchemy-pytest/ is where I found the original idea, and Isolating py.test DB sessions in Flask-SQLAlchemy is one of the questions recommending using function-level db re-creation.
I had also seen https://github.com/mitsuhiko/flask-sqlalchemy/pull/249 , but that appears to have been released with flask-sqlalchemy 2.1 (which I am using).
My current (very small, hopefully immediately understandable) repo is here:
https://github.com/hoopes/flask-pytest-example
There are two print statements - the first (in example/__init__.py) should have an Account object, and the second (in test/conftest.py) is where I expect the db to be cleared out after the transaction is rolled back.
If you pip install -r requirements.txt and run py.test -s from the test directory, you should see the two print statements.
I'm about at the end of my rope here - there must be something I'm missing, but for the life of me, I just can't seem to find it.
Help me, SO, you're my only hope!
You might want to give pytest-flask-sqlalchemy-transactions a try. It's a plugin that exposes a db_session fixture that accomplishes what you're looking for: allows you to run database updates that will get rolled back when the test exits. The plugin is based on Alex Michael's blog post, with some additional support for nested transactions that covers a wider array of user cases. There are also some configuration options for mocking out connectibles in your app so you can run arbitrary methods from your codebase, too.
For test_accounts.py, you could do something like this:
from example import db, Account
class TestAccounts(object):
def test_update_view(self, db_session):
test_acct = Account(username='abc')
db_session.add(test_acct)
db_session.commit()
resp = self.client.post('/update',
data={'a':1},
content_type='application/json')
assert resp.status_code == 200
The plugin needs access to your database through a _db fixture, but since you already have a db fixture defined in conftest.py, you can set up database access easily:
#pytest.fixture(scope='session')
def _db(db):
return db
You can find detail on how to setup and installation in the docs. Hope this helps!
I'm also having issues with the rollback, my code can be found here
After reading some documentation, it seems the begin() function should be called on the session.
So in your case I would update the session fixture to this:
#pytest.yield_fixture(scope='function', autouse=True)
def session(db, request):
"""Creates a new database session for a test."""
db.session.begin()
yield db.session
db.session.rollback()
db.session.remove()
I didn't test this code, but when I try it on my code I get the following error:
INTERNALERROR> Traceback (most recent call last):
INTERNALERROR> File "./venv/lib/python2.7/site-packages/_pytest/main.py", line 90, in wrap_session
INTERNALERROR> session.exitstatus = doit(config, session) or 0
...
INTERNALERROR> File "./venv/lib/python2.7/site-packages/_pytest/python.py", line 59, in filter_traceback
INTERNALERROR> return entry.path != cutdir1 and not entry.path.relto(cutdir2)
INTERNALERROR> AttributeError: 'str' object has no attribute 'relto'
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine
from unittest import TestCase
# global application scope. create Session class, engine
Session = sessionmaker()
engine = create_engine('postgresql://...')
class SomeTest(TestCase):
def setUp(self):
# connect to the database
self.connection = engine.connect()
# begin a non-ORM transaction
self.trans = self.connection.begin()
# bind an individual Session to the connection
self.session = Session(bind=self.connection)
def test_something(self):
# use the session in tests.
self.session.add(Foo())
self.session.commit()
def tearDown(self):
self.session.close()
# rollback - everything that happened with the
# Session above (including calls to commit())
# is rolled back.
self.trans.rollback()
# return connection to the Engine
self.connection.close()
sqlalchemy doc has solution for the case

Persistent connection in twisted

I'm new in Twisted and have one question. How can I organize a persistent connection in Twisted? I have a queue and every second checks it. If have some - send on client. I can't find something better than call dataReceived every second.
Here is the code of Protocol implementation:
class SyncProtocol(protocol.Protocol):
# ... some code here
def dataReceived(self, data):
if(self.orders_queue.has_new_orders()):
for order in self.orders_queue:
self.transport.write(str(order))
reactor.callLater(1, self.dataReceived, data) # 1 second delay
It works how I need, but I'm sure that it is very bad solution. How can I do that in different way (flexible and correct)? Thanks.
P.S. - the main idea and alghorithm:
1. Client connect to server and wait
2. Server checks for update and pushes data to client if anything changes
3. Client do some operations and then wait for other data
Without knowing how the snippet you provided links into your internet.XXXServer or reactor.listenXXX (or XXXXEndpoint calls), its hard to make head-or-tails of it, but...
First off, in normal use, a twisted protocol.Protocol's dataReceived would only be called by the framework itself. It would be linked to a client or server connection directly or via a factory and it would be automatically called as data comes into the given connection. (The vast majority of twisted protocols and interfaces (if not all) are interrupt based, not polling/callLater, thats part of what makes Twisted so CPU efficient)
So if your shown code is actually linked into Twisted via a Server or listen or Endpoint to your clients then I think you will find very bad things will happen if your clients ever send data (... because twisted will call dataReceived for that, which (among other problems) would add extra reactor.callLater callbacks and all sorts of chaos would ensue...)
If instead, the code isn't linked into twisted connection framework, then your attempting to reuse twisted classes in a space they aren't designed for (... I guess this seems unlikely because I don't know how non-connection code would learn of a transport, unless your manually setting it...)
The way I've been build building models like this is to make a completely separate class for the polling based I/O, but after I instantiate it, I push my client-list (server)factory into the polling instance (something like mypollingthing.servfact = myserverfactory) there-by making a way for my polling logic to be able to call into the clients .write (or more likely a def I built to abstract to the correct level for my polling logic)
I tend to take the examples in Krondo's Twisted Introduction as one of the canonical examples of how to do twisted (other then twisted matrix), and the example in part 6, under "Client 3.0" PoetryClientFactory has a __init__ that sets a callback in the factory.
If I try blend that with the twistedmatrix chat example and a few other things, I get:
(You'll want to change sendToAll to whatever your self.orders_queue.has_new_orders() is about)
#!/usr/bin/python
from twisted.internet import task
from twisted.internet import reactor
from twisted.internet.protocol import Protocol, ServerFactory
class PollingIOThingy(object):
def __init__(self):
self.sendingcallback = None # Note I'm pushing sendToAll into here in main
self.iotries = 0
def pollingtry(self):
self.iotries += 1
print "Polling runs: " + str(self.iotries)
if self.sendingcallback:
self.sendingcallback("Polling runs: " + str(self.iotries) + "\n")
class MyClientConnections(Protocol):
def connectionMade(self):
print "Got new client!"
self.factory.clients.append(self)
def connectionLost(self, reason):
print "Lost a client!"
self.factory.clients.remove(self)
class MyServerFactory(ServerFactory):
protocol = MyClientConnections
def __init__(self):
self.clients = []
def sendToAll(self, message):
for c in self.clients:
c.transport.write(message)
def main():
client_connection_factory = MyServerFactory()
polling_stuff = PollingIOThingy()
# the following line is what this example is all about:
polling_stuff.sendingcallback = client_connection_factory.sendToAll
# push the client connections send def into my polling class
# if you want to run something ever second (instead of 1 second after
# the end of your last code run, which could vary) do:
l = task.LoopingCall(polling_stuff.pollingtry)
l.start(1.0)
# from: https://twistedmatrix.com/documents/12.3.0/core/howto/time.html
reactor.listenTCP(5000, client_connection_factory)
reactor.run()
if __name__ == '__main__':
main()
To be fair, it might be better to inform PollingIOThingy of the callback by passing it as an arg to it's __init__ (that is what is shown in Krondo's docs), For some reason, I tend to miss connections like this when I read code and find class-cheating easier to see, but that may just by my personal brain-damage.