How to access `request_seen()` inside Spider? - scrapy

I have a Spider and I have a situation where I want to check if the request I am going to schedule already exists in request_seen() or not?
I don't want any method to check inside a download/spider middleware, I just want to check inside my Spider.
Is there any way to call that method?

You should be able to access the dupe filter itself from the spider like this:
self.dupefilter = self.crawler.engine.slot.scheduler.df
then you could use that in other places to check:
req = scrapy.Request('whatever')
if self.dupefilter.request_seen(req):
# it's already been seen
pass
else:
# never saw this one coming
pass

I did something similar to yours with pipeline. Following command is the code that I use.
You should specify an identifier and then go with it to check whether it is seen or not.
class SeenPipeline(object):
def __init__(self):
self.isbns_seen = set()
def process_item(self, item, spider):
if item['isbn'] in self.isbns_seen:
raise DropItem("Duplicate item found : %s" %item)
else:
self.isbns_seen.add(item['isbn'])
return item
Note: You can use these codes within your spider, too

Related

Scrapy concurrent spiders instance variables

I have a number of Scrapy spiders running and recently had a strange bug. I have a base class and a number of sub classes:
class MyBaseSpider(scrapy.Spider):
new_items = []
def spider_closed(self):
#Email any new items that weren't in the last run
class MySpiderImpl1(MyBaseSpider):
def parse(self):
#Implement site specific checks
self.new_items.append(new_found_item)
class MySpiderImpl2(MyBaseSpider):
def parse(self):
#Implement site specific checks
self.new_items.append(new_found_item)
This seems to have been running well, new items get emailed to me on a per-site basis. However I've recently had some emails from MySpiderImpl1 which contain items from Site 2.
I'm following the documentation to run from a script:
scraper_settings = get_project_settings()
runner = CrawlerRunner(scraper_settings)
configure_logging()
sites = get_spider_names()
for site in sites:
runner.crawl(site.spider_name)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
I suspect the solution here is to switch to a pipeline which collates the items for a site and emails them out when pipeline.close_spider is called but I was surprised to see the new_items variable leaking between spiders.
Is there any documentation on concurrent runs? Is it bad practice to keep variables on a base class? I do also track other pieces of information on the spiders in variables such as the run number - should this be tracked elsewhere?
In python all class variables are shared between all instances and subclasses. So your MyBaseSpider.new_items is the exact same list that is used by MySpiderImpl1.new_items and MySpiderImpl2.new_items.
As you suggested you could implement a pipeline, although this might require significantly refactoring your current code. It could look something like this.
pipelines.py
class MyPipeline:
def process_item(self, item, spider):
if spider.name == 'site1':
... email item
elif spider.name == 'site2':
... do something different
I am assuming all of your spiders have names... I think it's a requirement.
Another option that probably requires less effort might be to override the start_requests method in your base class to assign a unique list at start of the crawling process.

JobQueue.run_repeating to run a function without command handler in Telegram

I need to start sending notifications to a TG group, before that I want to run a function continuosly which would query an API and store data in DB. While this function is running I would want to be able to send notifications if they are available in the DB:
That's my code:
import telegram
from telegram.ext import Updater,CommandHandler, JobQueue
token = "tokenno:token"
bot = telegram.Bot(token=token)
def start(update, context):
context.bot.send_message(chat_id=update.message.chat_id,
text="You will now receive msgs!")
def callback_minute(context):
chat_id = context.job.context
# Check in DB and send if new msgs exist
send_msgs_tg(context, chat_id)
def callback_msgs():
fetch_msgs()
def main():
JobQueue.run_repeating(callback_msgs, interval=5, first=1, context=None)
updater = Updater(token,use_context=True)
dp = updater.dispatcher
dp.add_handler(CommandHandler("start",start, pass_job_queue=True))
updater.start_polling()
updater.idle()
if __name__ == '__main__':
main()
This code gives me error:
TypeError: run_repeating() missing 1 required positional argument: 'callback'
Any help would greatly appreciated
There are a few issues with your code, let me try to point them out:
1.
def callback_msgs(): fetch_msgs()
You use callback_msgs as callback for your job. But job callbacks take exactly one argument of type telegram.ext.CallbackContext.
JobQueue.run_repeating(callback_msgs, interval=5, first=1, context=None)
JobQueue is a class. To use run_repeating, which is an instance method, you'll need an instance of that class. In fact the Updater already builds an instance for you, it's available as updater.job_queue in your case. So the call should look like this:
updater.job_queue.run_repating(callback_msgs, interval=5, first=1, context=None)
CommandHandler("start",start, pass_job_queue=True)
This is not strictly speaking an issue, bot pass_job_queue=True has no effect at all, because you use use_context=True
Please note that there is a nice tutorial on JobQueue over at the ptb-wiki. There is also an example on how to use it.
Disclaimer: I'm currently the maintainer of python-telegram-bot

Adding custom headers to all boto3 requests

I need to add some custom headers to every boto3 request that is sent out. Is there a way to manage the connection itself to add these headers?
For boto2, connection.AWSAuthConnection has a method build_base_http_request which has been helpful. I've yet to find an analogous function within the boto3 documentation though.
This is pretty dated but we encountered the same issue, so I'm posting our solution.
I wanted to add custom headers to boto3 for specific requests.
I found this: https://github.com/boto/boto3/issues/2251, and used the event system for adding the header
def _add_header(request, **kwargs):
request.headers.add_header('x-trace-id', 'trace-trace')
print(request.headers) # for debug
some_client = boto3.client(service_name=SERVICE_NAME)
event_system = some_client.meta.events
event_system.register_first('before-sign.EVENT_NAME.*', _add_header)
You can try using a wildcard for all requests:
event_system.register_first('before-sign.*.*', _add_header)
*SERVICE_NAME- you can find all available services here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/index.html
For more information about register a function to a specific event: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/events.html
Answer from #May Yaari is pretty awesome. To the concern raised by #arainchi:
This works, there is no way to pass custom data to event handlers, currently we have to do it in a non-pythonic way using global variables/queues :( I have opened issue ticket with Boto3 developers for this exact case
Actually, we could leverage the python functional programming property: returning a function inside a function to get around:
In the case we want to add a custom value custom_variable to the header, we could do
some_client = boto3.client(service_name=SERVICE_NAME)
event_system = some_client.meta.events
event_system.register_first('before-sign.EVENT_NAME.*', _register_callback(custom_variable))
def _register_callback(custom_variable):
def _add_header(request, **kwargs):
request.headers.add_header('header_name_you_want', custom_variable)
return _add_header
Or a more pythonic way using lambda
some_client = boto3.client(service_name=SERVICE_NAME)
event_system = some_client.meta.events
event_system.register_first('before-sign.EVENT_NAME.*', lambda request, **kwargs: _add_header(request, custom_variable))
def _add_header(request, custom_variable):
request.headers.add_header('header_name_you_want', custom_variable)

Override Scrapy output format 'on the fly'

I want to override spider's output format right in the code. I can't modify the settings; I can't change the command line. I want to do it right in the __init__ method.
Ideally, that new output format should work even if something like -o /tmp/1.csv is passed to the spider. But if it's not possible, then pass it.
How can I do that?
Thank you.
So, you can put a custom attribute in your spider that sets up how the data should be handled for this spider and create a Scrapy item pipeline that honors that configuration.
Your spider code would look like:
from scrapy import Spider
class MySpider(Spider):
def __init__(self, *args, **kwargs):
super(MySpider, self).__init(*args, **kwargs)
self.data_destination = self._get_data_destination()
def _get_data_destination(self):
# return your dynamically discovered data destination settings here
And your item pipeline would be something like:
class MySuperDuperPipeline(object):
def process_item(self, item, spider):
data_destination = getattr(spider, 'data_destination')
# code to handle item conforming to data destination here...
return item

Handling traversal in Zope2 product

I want to create a simple Zope2 product that implements a "virtual" folder where a part of the path is processed by my code. A URI of the form
/members/$id/view
e.g.
/members/isaacnewton/view
should be handled by code in the /members object, i.e. a method like members.view(id='isaacnewton').
The Zope TTW Python scripts have traverse_subpath but I have no idea how to do this in my product code.
I have looked at the IPublishTraverse interface publishTraverse() but it seems very generic.
Is there an easier way?
Still the easiest way is to use the __before_publishing_traverse__ hook on the members object:
from zExceptions import Redirect
def __before_publishing_traverse__(self, object, request):
stack = request.TraversalRequestNameStack
if len(stack) > 1 and stack[-2] == 'view':
try:
self.request.form['member_id'] = stack.pop(-1)
if not validate(self.request['member_id']):
raise ValueError
except (IndexError, ValueError):
# missing context or not an integer id; perhaps some URL hacking going on?
raise Redirect(self.absolute_url()) # redirects to `/members`, adjust as needed
This method is called by the publisher before traversing further; so the publisher has already located the members object, and this method is passed itself (object), and the request. On the request you'll find the traversal stack; in your example case that'll hold ['view', 'isaacnewton'] and this method moves 'isaacnewton' to the request under the key 'member_id' (after an optional validation).
When this method returns, the publisher will use the remaining stack to continue the traverse, so it'll now traverse to view, which should be a browser view that expects a member_id key in the request. It then can do it's work:
class MemberView(BrowserView):
def __call__(self):
if 'member_id' in self.request.form: # Huzzah, the traversal worked!