Scrapy: How to send data to the pipeline from a custom filter without downloading - scrapy

To catch all redirection paths, including when the final url was already crawled, I wrote a custom duplicate filter:
import logging
from scrapy.dupefilters import RFPDupeFilter
from seoscraper.items import RedirectionItem
class CustomURLFilter(RFPDupeFilter):
def __init__(self, path=None, debug=False):
super(CustomURLFilter, self).__init__(path, debug)
def request_seen(self, request):
request_seen = super(CustomURLFilter, self).request_seen(request)
if request_seen is True:
item = RedirectionItem()
item['sources'] = [ u for u in request.meta.get('redirect_urls', u'') ]
item['destination'] = request.url
return request_seen
Now, how can I send the RedirectionItem directly to the pipeline?
Is there a way to instantiate the pipeline from the custom filter so that I can send data directly? Or shall I also create a custom scheduler and get the pipeline from there but how?

Related

Python Telegram Bot ConversationHandler not working with webhook

I want to make a ConversationHandler in my bot that is using a webhook, the ConversationHandler only runs the function at the entry point, after that neither does it run the state function, nor does it run the fallback function. This CommandHandler runs fine when bot is run by polling.
ConversationHandler:
conv_handler = ConversationHandler(
entry_points=[CommandHandler("start", start)],
states={
NEW_PROJECT: [CallbackQueryHandler(project_name)],
PROJECT_NAME: [MessageHandler(Filters.regex(".*"), store_name_maybe_project_type)],
PROJECT_TYPE: [CallbackQueryHandler(store_type_maybe_admin)]
},
fallbacks=[CommandHandler('cancel', cancel)],
)
All the required functions:
def start(update, context):
# Gives button of add project
# We can use for loop to display buttons
keyboard = [
[InlineKeyboardButton("Add Project", callback_data="add_project")],
]
reply_markup = InlineKeyboardMarkup(keyboard)
update.message.reply_text("You have no projects right now.", reply_markup=reply_markup)
# if existing project then PROJECT or else NEW_PROJECT
return NEW_PROJECT
def project_name(update, context):
# asks for project name
query = update.callback_query
update.message.reply_text(text="Okay, Please enter your project name:")
return PROJECT_NAME
def store_name_maybe_project_type(update, context):
# stores project name and conditionally asks for project type
print(update.message.text)
keyboard = [
[InlineKeyboardButton("Telegram Group", callback_data="group")],
[InlineKeyboardButton("Telegram Channel", callback_data="channel")]
]
reply_markup = InlineKeyboardMarkup(keyboard)
update.message.reply_text("What do you want to make?", reply_markup=reply_markup)
return PROJECT_TYPE
def store_type_maybe_admin(update, context):
# stores project type and conditonally asks for making admin
print(update.message.text)
keyboard = [[InlineKeyboardButton("Done", callback_data="done")]]
reply_markup = InlineKeyboardMarkup(keyboard)
update.message.reply_text(f"Make a private {update.message.text} and make this bot the admin", reply_markup=reply_markup)
return ConversationHandler.END
def cancel(update, context):
update.message.reply_text("Awww, that's too bad")
return ConversationHandler.END
This is how I set up the webhook(I think the problem is here somewhere):
#app.route(f"/{TOKEN}", methods=["POST"])
def respond():
"""Run the bot."""
update = telegram.Update.de_json(request.get_json(force=True), bot)
dispatcher = setup(bot, update)
dispatcher.process_update(update)
return "ok"
The setup function
def setup(bot, update):
# Create bot, update queue and dispatcher instances
dispatcher = Dispatcher(bot, None, workers=0)
##### Register handlers here #####
bot_handlers = initialize_bot(update)
for handler in bot_handlers:
dispatcher.add_handler(handler)
return dispatcher
And then I manually setup the webhook by using this route:
#app.route("/setwebhook", methods=["GET", "POST"])
def set_webhook():
s = bot.setWebhook(f"{URL}{TOKEN}")
if s:
return "webhook setup ok"
else:
return "webhook setup failed"
The add project button doesn't do anything.
ConversationHandler stores the current state in memory, so it's lost once the conv_handler reaches the end of it's lifetime (i.e. the variable is deleted or the process is shut down). Now your snippets don't show where you initialize the ConversationHandler, but I have the feeling that you create it anew for every incoming update - and every new instance doesn't have the knowledge of the previous one.
I have that feeling, because you create a new Dispatcher for every update as well. That's not necessary and in fact I'd strongly advise against it. Not only does it take time to initialize the Dispatcher, which you could save, but also if you're using chat/user/bot_data, the data get's lost every time you create a new instance.
The initialize_bot function is called in setup, where you create the new Dispatcher, which is why my guess would be that you create a new ConversationHandler for every update. Also it seems odd to me that the return value of that function seems to be dependent on the update - the handlers used by your dispatcher should be fixed ...
Disclaimer: I'm currently the maintainer of python-telegram-bot

Flask-Appbuilder - User security role required in View

If the current user role = admin then show all the records in the table. If not, then limit the rows by created user.
I can get the user name if I define a function in the View Class, but need it before the list is constructed. See source code below.
from flask_appbuilder.models.sqla.filters import FilterEqualFunction
from app import appbuilder, db
from app.models import Language
from wtforms import validators, TextField
from flask import g
from flask_appbuilder.security.sqla.models import User
def get_user():
return g.user
class LanguageView(ModelView):
datamodel = SQLAInterface(Language)
list_columns = ["id", "name"]
base_order = ("name", "asc")
page_size = 50
#This is the part that does not work - unable to import app Error: Working outside of application context
#If the user role is admin show all, if not filter only to the specific user
if g.user.roles != "admin":
base_filters = [['created_by', FilterEqualFunction, get_user]]
This is the error I'm getting:
Was unable to import app Error: Working outside of application context.
This typically means that you attempted to use functionality that needed
to interface with the current application object in some way. To solve
this, set up an application context with app.app_context(). See the
documentation for more information.
In this case it is better to create two different ModelViews, one for users with base_filters = [['created_by', FilterEqualFunction, get_user]] and the second for admins only without any filtering.
And do not forget to specify correct permissions for both.
In my case, I created new filter FilterStartsWithFunction by coping FilterStartsWith(You can find the source code in Flask-Appbuilder pack easily. ). see the codes
from flask_appbuilder.models.sqla.filters import get_field_setup_query
class FilterStartsWithFunction(BaseFilter):
name = "Filter view with a function"
arg_name = "eqf"
def apply(self, query, func):
query, field = get_field_setup_query(query, self.model, self.column_name)
return query.filter(field.ilike(func() + "%"))
def get_user():
if 'Admin' in [r.name for r in g.user.roles]:
return ''
else:
return g.user.username
...
...
base_filters = [['created_by',FilterStartsWithFunction,get_user]]

Django unit testing: AttributeError: 'WSGIRequest' object has no attribute 'user'

When running the tests i get outputted following error:
user = self.request.user
AttributeError: 'WSGIRequest' object has no attribute 'user'
I have tried switching from MIDDLEWARE to MIDDLEWARE_CLASSES and vice versa. Currently, I'm running Django 2.1.
Also, I have tried switching middleware positions and it didn't help.
settings.py
MIDDLEWARE = (
"django.contrib.sessions.middleware.SessionMiddleware",
"django.middleware.common.CommonMiddleware",
"django.middleware.csrf.CsrfViewMiddleware",
"django.contrib.auth.middleware.AuthenticationMiddleware",
"django.contrib.auth.middleware.SessionAuthenticationMiddleware",
"django.contrib.messages.middleware.MessageMiddleware",
"django.middleware.clickjacking.XFrameOptionsMiddleware",
"django.middleware.security.SecurityMiddleware",
"whitenoise.middleware.WhiteNoiseMiddleware", # to serve static files
)
test_views.py
from django.test import (TestCase,
RequestFactory
)
from mixer.backend.django import mixer
from .. import views
from accounts.models import User
class TestHomeView(TestCase):
def test_anonymous(self):
req = RequestFactory().get("/")
resp = views.HomeView.as_view()(req)
self.assertEquals(resp.status_code, 200,
"Should be callable by anyone")
def test_auth_user(self):
req = RequestFactory().get("/")
req.user = mixer.blend("accounts.User")
resp = views.HomeView.as_view()(req)
self.assertTrue("dashboard" in resp.url,
"Should redirect authenticated user to /dashboard/")
error output
user = self.request.user
AttributeError: 'WSGIRequest' object has no attribute 'user'
In the first test test_anonymous, you are not actually setting the user. As per documentation https://docs.djangoproject.com/en/2.1/topics/testing/advanced/#the-request-factory, the middleware are not executed if you use RequestFactory and you have to set the user explicitly. So if you want to test again AnonymousUser, you should set request.user = AnonymousUser(). If you don't do this, the request object will not have the user attribute, and thus the error.

Custom scrapy xml+rss exporter to S3

I'm trying to create a custom xml feed, that will contain the spider scraped items, as well as some other high level information, stored in the spider definition. The output should be stored on S3.
The desired output looks like the following:
<xml>
<title>my title defined in the spider</title>
<description>The description from the spider</description>
<items>
<item>...</item>
</items>
</xml>
In order to do so, I defined a custom exporter, which is able to export the desired output file locally.
spider.py:
class DmozSpider(scrapy.Spider):
name = 'dmoz'
allowed_domains = ['dmoz.org']
start_urls = ['http://www.dmoz.org/Computers/']
title = 'The DMOZ super feed'
def parse(self, response):
...
yield item
exporters.py:
from scrapy.conf import settings
class CustomItemExporter(XmlItemExporter):
def __init__(self, *args, **kwargs):
self.title = kwargs.pop('title', 'no title found')
self.link = settings.get('FEED_URI', 'localhost')
super(CustomItemExporter, self).__init__(*args, **kwargs)
def start_exporting(self):
...
self._export_xml_field('title', self.title)
...
settings.py:
FEED_URI = 's3://bucket-name/%(name)s.xml'
FEED_EXPORTERS = {
'custom': 'my.exporters.CustomItemExporter',
}
I'm able to run the whole thing and get the output on s3 by running the following command:
scrapy crawl dmoz -t custom
or, if I want to export a json locally instead: scrapy crawl -o dmoz.json dmoz
But at this point, I'm unable to retrieve the spider title to put it in the output file.
I tried implementing a custom pipeline, which outputs data locally (following numerous examples):
pipelines.py:
class CustomExportPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_feed.xml' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CustomItemExporter(
file,
title = spider.title),
)
self.exporter.start_exporting()
The problem is, the file is stored locally, and this short circuits the FeedExporter logic defined in feedexport.py, that handles all the different storages.
No info from the FeedExporter is available in the pipeline, and I would like to reuse all that logic without duplicating code. Am I missing something? Thanks for any help.
Here's my solution:
get rid of the pipeline.
Override scrapy's FeedExporter
myproject/feedexport.py:
from scrapy.extensions.feedexport import FeedExporter as _FeedExporter
from scrapy.extensions.feedexport import SpiderSlot
class FeedExporter(_FeedExporter):
def open_spider(self, spider):
uri = self.urifmt % self._get_uri_params(spider)
storage = self._get_storage(uri)
file = storage.open(spider)
extra = {
# my extra settings
}
exporter = self._get_exporter(file, fields_to_export=self.export_fields, extra=extra)
exporter.start_exporting()
self.slot = SpiderSlot(file, exporter, storage, uri)
All I wanted to do was basically to pass those extra settings to the exporter, but the way it's built, there is no choice but to override.
To support other scrapy export formats simultaneously, I would have to consider overriding the dont_fail settings to True in some scrapy exporters to prevent them from failing
Replace scrapy's feed exporter by the new one
myproject/feedexport.py:
EXTENSIONS = {
'scrapy.extensions.feedexport.FeedExporter': None,
'myproject.feedexport.FeedExporter': 0,
}
... or the 2 feed exporters would run at the same time

Django Rest Framework custom permissions per view

I want to create permissions in Django Rest Framework, based on view + method + user permissions.
Is there a way to achieve this without manually writing each permission, and checking the permissions of the group that the user is in?
Also, another problem I am facing is that permission objects are tied up to a certain model. Since I have views that affect different models, or I want to grant different permissions on the method PUT, depending on what view I accessed (because it affects different fields), I want my permissions to be tied to a certain view, and not to a certain model.
Does anyone know how this can be done?
I am looking for a solution in the sort of:
Create a Permissions object with the following parameters: View_affected, list_of_allowed_methods(GET,POST,etc.)
Create a Group object that has that permission associated
Add a user to the group
Have my default permission class take care of everything.
From what I have now, the step that is giving me problems is step 1. Because I see no way of tying a Permission with a View, and because Permissions ask for a model, and I do not want a model.
Well, the first step could be done easy with DRF. See http://www.django-rest-framework.org/api-guide/permissions#custom-permissions.
You must do something like that:
from functools import partial
from rest_framework import permissions
class MyPermission(permissions.BasePermission):
def __init__(self, allowed_methods):
super().__init__()
self.allowed_methods = allowed_methods
def has_permission(self, request, view):
return request.method in self.allowed_methods
class ExampleView(APIView):
permission_classes = (partial(MyPermission, ['GET', 'HEAD']),)
Custom permission can be created in this way, more info in official documentation( https://www.django-rest-framework.org/api-guide/permissions/):
from rest_framework.permissions import BasePermission
# Custom permission for users with "is_active" = True.
class IsActive(BasePermission):
"""
Allows access only to "is_active" users.
"""
def has_permission(self, request, view):
return request.user and request.user.is_active
# Usage
from rest_framework.views import APIView
from rest_framework.response import Response
from .permissions import IsActive # Path to our custom permission
class ExampleView(APIView):
permission_classes = (IsActive,)
def get(self, request, format=None):
content = {
'status': 'request was permitted'
}
return Response(content)
I took this idea and got it to work like so:
class genericPermissionCheck(permissions.BasePermission):
def __init__(self, action, entity):
self.action = action
self.entity = entity
def has_permission(self, request, view):
print self.action
print self.entity
if request.user and request.user.role.access_rights.filter(action=self.action,entity=self.entity):
print 'permission granted'
return True
else:
return False
I used partially in the decorator for the categories action in my viewset class like so:
#list_route(methods=['get'],permission_classes=[partial(genericPermissionCheck,'Fetch','Categories')])
def Categories(self, request):
"access_rights" maps to an array of objects with a pair of actions and object e.g. 'Edit' and 'Blog'