Scrapy : Preserving a website - scrapy

I'm trying to save a copy the pyparsing project on wikispaces.com before they take wikispaces down at the end of the month.
It seems odd (perhaps my version of google is broken ^_^) but I can't find any examples of duplicating/copying a site as. That is, as one views it upon a browser. SO has this and this on the topic but they are just saving the text, strictly the HTML/DOM structure, for the site. Unless I'm mistaken these asnwers do not appear to save the images/header link files/javascript and related information necessary to render the page. Further examples I have seen are more concerned with extraction of parts of the page and not duplicating it as is.
I was wondering if anyone had any experience with this sort of thing or could point me to a useful blog/doc somewhere. I've used WinHTTrack in the past but the robots.txt or the pyparsing.wikispaces.com/auth/ route are preventing it from running properly and I figured I'd get some scrapy experience in.
For those interested to see what I have tried thus far. Here is my crawl spider implementation, that acknowledges the robots.txt file
import scrapy
from scrapy.spiders import SitemapSpider
from urllib.parse import urlparse
from pathlib import Path
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class PyparsingSpider(CrawlSpider):
name = 'pyparsing'
allowed_domains = ['pyparsing.wikispaces.com']
start_urls = ['http://pyparsing.wikispaces.com/']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
# i = {}
# #i['domain_id'] = response.xpath('//input[#id="sid"]/#value').extract()
# #i['name'] = response.xpath('//div[#id="name"]').extract()
# #i['description'] = response.xpath('//div[#id="description"]').extract()
# return i
page = urlparse(response.url)
path = Path(page.netloc)/Path("" if page.path == "/" else page.path[1:])
if path.parent : path.parent.mkdir(parents = True, exist_ok=True) # Creates the folder
path = path.with_suffix(".html")
with open(path, 'wb') as file:
file.write(response.body)
Trying the same thing with the sitemap spider is similar. The first SO link provides an implementation with a plain spider.
import scrapy
from scrapy.spiders import SitemapSpider
from urllib.parse import urlparse
from pathlib import Path
class PyParsingSiteMap(SitemapSpider) :
name = "pyparsing"
sitemap_urls = [
'http://pyparsing.wikispaces.com/sitemap.xml',
# 'http://pyparsing.wikispaces.com/robots.txt',
]
allowed_domains = ['pyparsing.wikispaces.com']
start_urls = ['http://pyparsing.wikispaces.com'] # "/home"
custom_settings = {
"ROBOTSTXT_OBEY" : False
}
def parse(self, response) :
page = urlparse(response.url)
path = Path(page.netloc)/Path("" if page.path == "/" else page.path[1:])
if path.parent : path.parent.mkdir(parents = True, exist_ok=True) # Creates the folder
path = path.with_suffix(".html")
with open(path, 'wb') as file:
file.write(response.body)
None of these spiders collect more then the HTML structure
Also I have found that the links, ..., that are saved do not appear to point to proper relative paths. Atleast, when opening the saved files, the links point to a path relative to the hard drive and not relative to the file. While opening a page via http.server the links point to dead locations, presumably the .html extension is the trouble here. It might be necessary to remap/replace links in the stored structure.

Related

How to crawl whole website by following links using CrawlSpider?

I realized that using CrawlSpider with a LinkExtractor rule only parses the linked pages but not the starting page itself.
For example, if http://mypage.test contains links to http://mypage.test/cats/ and http://mypage.test/horses/, the crawler would parse the cats and horses page without parsing http://mypage.test. Here's a simple code sample:
from scrapy.crawler import CrawlerProcess
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'myspider'
start_urls = ['http://mypage.test']
rules = [
Rule(LinkExtractor(), callback='parse_page', follow=True),
]
def parse_page(self, response):
yield {
'url': response.url,
'status': response.status,
}
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'ITEM_PIPELINES': {
'pipelines.MyPipeline': 100,
},
})
process.crawl(MySpider)
process.start()
My goal is to parse every single page in a website by following links. How do I accomplish that?
Apparently, CrawlSpider with a LinkExtractor rule only parses the linked pages but not the starting page itself.
Remove start_urls and add:
def start_requests(self):
yield Request('http://mypage.test', callback="parse_page")
yield Request("http://mypage.test", callback="parse")
CrawlSpider uses self.parse to extract and follow links.

Problem having same data while crawling a web page

I am trying to crawl a web page to get reviews and ratings of that web page. But i am getting the same data as the output.
import scrapy
import json
from scrapy.spiders import Spider
class RatingSpider(Spider):
name = "rate"
def start_requests(self):
for i in range(1, 10):
url = "https://www.fandango.com/aquaman-208499/movie-reviews?pn=" + str(i)
print(url)
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(json.dumps({'rating': response.xpath("//div[#class='star-rating__score']").xpath("#style").extract(),
'review': response.xpath("//p[#class='fan-reviews__item-content']/text()").getall()}))
expected: crawling 1000 pages of the web site https://www.fandango.com/aquaman-208499/movie-reviews
actual output:
https://mobile.fandango.com/aquaman-208498/movie-reviews?pn=1
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}
https://mobile.fandango.com/aquaman-208499/movie-reviews?pn=9
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}
The reviews are dynamically populated using JavaScript.
You have to inspect the requests made by the site in cases likes this.
The URL to get user reviews is this:
https://www.fandango.com/napi/fanReviews/208499/1/5
It returns a json with 5 reviews.
Your spider could be rewrite like this:
import scrapy
import json
from scrapy.spiders import Spider
class RatingSpider(Spider):
name = "rate"
def start_requests(self):
movie_id = "208499"
for page in range(1, 10):
# You have to pass the referer, otherwise the site returns a 403 error
headers = {'referer': 'https://www.fandango.com/aquaman-208499/movie-reviews?pn={page}'.format(page=page)}
url = "https://www.fandango.com/napi/fanReviews/208499/{page}/5".format(page=page)
yield scrapy.Request(url=url, callback=self.parse, headers=headers)
def parse(self, response):
data = json.loads(response.text)
for review in data['data']:
yield review
Note that I am also using yield instead of print to extract the items, this is how Scrapy expect items to be generated.
You can run this spider like this to export the extracted items to a file:
scrapy crawl rate -o outputfile.json

template_name in LogoutView not working on django2.1 python3.7

Below is default setting in django.contrib.auth.views.LogoutViews,
template_name = 'registration/logged_out.html'
i configurate my app's urls.py like this:
from django.urls import path
from . import views
from django.conf import settings
from django.contrib.auth.views import LoginView, LogoutView
app_name = 'account'
urlpatterns = [
#path("login/", views.user_login, name="user_login"),
path("login/", LoginView.as_view(), name="user_login"),
path("nlogin/", LoginView.as_view(), {"template_name":"account/login.html"}),
path("logout/", LogoutView.as_view(), name="user_logout"),
path("logoutt/", LogoutView.as_view(), {"template_name":"account/logout.html"}),
]
"template_name":"account/login.html" works properly, but "template_name":"account/logout.html" seems make no difference, what's wrong with my code?
When you use the class-based variant, you pass settings to the view through the .as_view (the so called **initkwargs) method:
from django.urls import path
from . import views
from django.conf import settings
from django.contrib.auth.views import LoginView, LogoutView
app_name = 'account'
urlpatterns = [
#path("login/", views.user_login, name="user_login"),
path("login/", LoginView.as_view(), name="user_login"),
path("nlogin/", LoginView.as_view(template_name='account/login.html')),
path("logout/", LogoutView.as_view(), name="user_logout"),
path("logoutt/", LogoutView.as_view(template_name='account/logout.html')),
]
Otherwise the parameters will end up in the self.kwargs, and the class-based view does not inspect these.
The documentation on the LoginView [Django-doc] mentions this as well as a list of parameters you can pass as **initkwargs.
According to Willem Van Onsem's advice, i found the key problem is i mixed up two way of urlpatterns, like this:
url() and regular expression type in urls.py (I learned in django 1.10.1 tutorial)
from django.conf.urls import url
from django.contrib.auth import views
urlpatterns = [
url(r"^login/$", views.login, {"template_name"="account/login.html"}, name='user_login'),
]
path() type in urls.py(django 2.1 docs)
from django.urls import path
from django.contrib.auth.views import LoginView
urlpatterns = [
path("login/", LoginView.as_view(template_name="account/login.html"),name="user_login"),
]
It's obvious that there be two major difference to note:
url import from django.conf.urls, but path import from django.urls straightly, and path type is new in django 2.0, path seems more simple
in django 2.1, LoginView & LogoutView settings pass on as_view(), compare to older expression views.login, {"template_name"="account/login.html"}, simpler too

How to implement django-fluent-contents to Mezzanine existing project

I have existing Mezzanine project with existing pages. It is possible to implement to AdminPage fluent-contents feature without fluent-pages feature? Just want to Mezzanine Page creation as it is but with fluent-contents in it. Is this possible to implement? Can anybody show some example how to implement it to Mezzanine AdminPage.
Since nobody had this problem and I already figure it out and successfully implement Fluent-contents to existing Mezzanine project.
It is very simple but investigation takes edit core sources of mezzanine cms. After all output solution was simple app that extending Mezzanine Pages as in Admin or Client side.
(DIFFICULTY: medium/expert)
SOLUTION:
(for this example I had used app with name "cms_coremodul")
PS: It was made with ver. Python 3.4 with virtual environment.
MEZZANINE SETUP AND INSTALLS:
-version of Mezzanine 4.0.1
-install fluent-contents with desired plugins what you need
(follow fluent-contents docs).
pip install django-fluent-contents
-also you can optionally install powerfull wysiwyg CKEditor.
pip install django-ckeditor
-after you have all installed lets go to setup settings.py and migrate everything up.
settings.py :
-fluent-contents have to be above of your app and below of Mezzanine apps.
INSTALLED_APPS = (
...
"fluent_contents",
"django_wysiwyg",
"ckeditor",
# all working fluent-contents plugins
'fluent_contents.plugins.text', # requires django-wysiwyg
'fluent_contents.plugins.code', # requires pygments
'fluent_contents.plugins.gist',
'fluent_contents.plugins.iframe',
'fluent_contents.plugins.markup',
'fluent_contents.plugins.rawhtml',
'fluent_contents.plugins.picture',
'fluent_contents.plugins.oembeditem',
'fluent_contents.plugins.sharedcontent',
'fluent_contents.plugins.googledocsviewer',
...
'here_will_be_your_app',
)
-settings for django-ckeditor:
settings.py :
# CORE MODUL DEFAULT WYSIWYG EDITOR SETUP
RICHTEXT_WIDGET_CLASS = "ckeditor.widgets.CKEditorWidget"
RICHTEXT_FILTER_LEVEL = 3
DJANGO_WYSIWYG_FLAVOR = "ckeditor"
# CKEditor config
CKEDITOR_CONFIGS = {
'awesome_ckeditor': {
'toolbar': 'Full',
},
'default': {
'toolbar': 'Standard',
'width': '100%',
},
}
-after setup of settings.py of fluent-contents is complete lets migrate everything up:
python manage.py migrate
-if there will be any error/fault about dependencies of fluent-contents, install that dependency and migrate again.
CREATE NEW APP FOR FLUENT-CONTENTS:
Create an new app in Mezzanine project (same as in Django):
python manage.py startapp nameofyourapp
models.py :
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
from django.db import models
from mezzanine.pages.models import Page
from django.utils.translation import ugettext_lazy as _
from fluent_contents.models import PlaceholderRelation, ContentItemRelation
from mezzanine.core.fields import FileField
from . import appconfig
class CoreModulPage(Page):
template_name = models.CharField("Template choice", max_length=255, choices=appconfig.TEMPLATE_CHOICES, default=appconfig.COREMODUL_DEFAULT_TEMPLATE)
# Accessing the data of django-fluent-contents
placeholder_set = PlaceholderRelation()
contentitem_set = ContentItemRelation()
class Meta:
verbose_name = _("Core page")
verbose_name_plural = _("Core pages")
admin.py :
from django.contrib import admin
from django.http.response import HttpResponse
from mezzanine.pages.admin import PageAdmin
import json
# CORE MODUL IMPORT
from fluent_contents.admin import PlaceholderEditorAdmin
from fluent_contents.analyzer import get_template_placeholder_data
from django.template.loader import get_template
from .models import CoreModulPage
from . import appconfig
from fluent_contents.admin.placeholdereditor import PlaceholderEditorInline
class CoreModulAdmin(PlaceholderEditorAdmin, PageAdmin):
#################################
#CORE MODUL - PAGE LOGIC
#################################
corepage = CoreModulPage.objects.all()
# CORE FLUENT-CONTENTS
# This is where the magic happens.
# Tell the base class which tabs to create
def get_placeholder_data(self, request, obj):
# Tell the base class which tabs to create
template = self.get_page_template(obj)
return get_template_placeholder_data(template)
def get_page_template(self, obj):
# Simple example that uses the template selected for the page.
if not obj:
return get_template(appconfig.COREMODUL_DEFAULT_TEMPLATE)
else:
return get_template(obj.template_name or appconfig.COREMODUL_DEFAULT_TEMPLATE)
# Allow template layout changes in the client,
# showing more power of the JavaScript engine.
# THIS LINES ARE OPTIONAL
# It sets your own path to admin templates and static of fluent-contents
#
# START OPTIONAL LINES
# this "PlaceholderEditorInline.template" is in templates folder of your app
PlaceholderEditorInline.template = "cms_plugins/cms_coremodul/admin/placeholder/inline_tabs.html"
# this "PlaceholderEditorInline.Media.js"
# and "PlaceholderEditorInline.Media.css" is in static folder of your app
PlaceholderEditorInline.Media.js = (
'cms_plugins/cms_coremodul/admin/js/jquery.cookie.js',
'cms_plugins/cms_coremodul/admin/js/cp_admin.js',
'cms_plugins/cms_coremodul/admin/js/cp_data.js',
'cms_plugins/cms_coremodul/admin/js/cp_tabs.js',
'cms_plugins/cms_coremodul/admin/js/cp_plugins.js',
'cms_plugins/cms_coremodul/admin/js/cp_widgets.js',
'cms_plugins/cms_coremodul/admin/js/fluent_contents.js',
)
PlaceholderEditorInline.Media.css = {
'screen': (
'cms_plugins/cms_coremodul/admin/css/cp_admin.css',
),
}
PlaceholderEditorInline.extend = False # No need for the standard 'admin/js/inlines.min.js' here.
#
# END OPTIONAL LINES
# template to change rendering template for contents (combobox in page to choose desired template to render)
change_form_template = "cms_plugins/cms_coremodul/admin/page/change_form.html"
class Media:
js = (
'cms_plugins/cms_coremodul/admin/js/coremodul_layouts.js',
)
def get_layout_view(self, request):
"""
Return the metadata about a layout
"""
template_name = request.GET['name']
# Check if template is allowed, avoid parsing random templates
templates = dict(appconfig.TEMPLATE_CHOICES)
if not templates.has_key(template_name):
jsondata = {'success': False, 'error': 'Template was not found!'}
status = 404
else:
# Extract placeholders from the template, and pass to the client.
template = get_template(template_name)
placeholders = get_template_placeholder_data(template)
jsondata = {
'placeholders': [p.as_dict() for p in placeholders],
}
status = 200
jsonstr = json.dumps(jsondata)
return HttpResponse(jsonstr, content_type='application/json', status=status)
admin.site.register(CoreModulPage, CoreModulAdmin)
appconfig.py :
-you have to create new appconfig.py file in your app.
from django.conf import settings
from django.core.exceptions import ImproperlyConfigured
TEMPLATE_CHOICES = getattr(settings, "TEMPLATE_CHOICES", ())
COREMODUL_DEFAULT_TEMPLATE = getattr(settings, "COREMODUL_DEFAULT_TEMPLATE", TEMPLATE_CHOICES[0][0] if TEMPLATE_CHOICES else None)
if not TEMPLATE_CHOICES:
raise ImproperlyConfigured("Value of variable 'TEMPLATE_CHOICES' is not set!")
if not COREMODUL_DEFAULT_TEMPLATE:
raise ImproperlyConfigured("Value of variable 'COREMODUL_DEFAULT_TEMPLATE' is not set!")
settings.py :
-this lines add to settings.py of your Mezzanine project.
# CORE MODUL TEMPLATE LIST
TEMPLATE_CHOICES = (
("pages/coremodulpage.html", "CoreModulPage"),
("pages/coremodulpagetwo.html", "CoreModulPage2"),
)
# CORE MODUL default template setup (if drop-down not exist in admin interface)
COREMODUL_DEFAULT_TEMPLATE = TEMPLATE_CHOICES[0][0]
-append your app to INSTALLED_APPS (add your app to INSTALLED_APPS).
INSTALLED_APPS = (
...
"yourappname_with_fluentcontents",
)
create templates for your contents of your app:
-template with one placeholder:
coremodulpage.html:
{% extends "pages/page.html" %}
{% load mezzanine_tags fluent_contents_tags %}
{% block main %}{{ block.super }}
{% page_placeholder page.coremodulpage "main" role='m' %}
{% endblock %}
-template with two placeholders (one aside):
{% extends "pages/page.html" %}
{% load mezzanine_tags fluent_contents_tags %}
{% block main %}{{ block.super }}
{% page_placeholder page.coremodulpage "main" role='m' %}
<aside>
{% page_placeholder page.coremodulpage "sidepanel" role='s' %}
</aside>
{% endblock %}
-after your app is setup lets make migrations:
-1.Make migration of your app:
python manage.py makemigrations yourappname
-2.Make migration of your app to database:
python manage.py migrate
FINALLY COMPLETE!
- try your new type of Admin Page with Fluent-contents plugin installed.
- In dropdown with Page type in Admin select Core Page. If you had create template for render of fluent-contents tab with placeholders shows with dropdown in it. Now you can select desired plugin and create your modular content of your page.

How to use middlewares when using scrapy runspider command?

I know that we can configure middlewares in settings.py when we have a scrapy project.
I haven't started a scrapy project, and I use runspider command to run spider, but I want to use some middlewares. How to set it in the spider file?
So, the problem is, when you run a spider using scrapy runspider my_file.py, you can use the -s option to pass only simple scalar spider settings (like strings or integers). The problem is, the SPIDER_MIDDLEWARES setting expects a dictionary, and there isn't a really straight-forward way to pass that through the command-line.
Currently, the only way I know to set SPIDER_MIDDLEWARES settings for a spider without a project is using custom spider settings, which is currently available in Scrapy from the code repo (not officially released yet) since Scrapy 1.0.
If you go that route, you can put your middlewares in a file middlewares.py and do:
import middlewares # need this, or you get import error
class MySpider(scrapy.Spider):
name = 'my-spider'
custom_settings = {
'SPIDER_MIDDLEWARES': {
'middlewares.SampleMiddleware': 500,
}
}
...
Alternatively, if you're putting the middleware class in the same file, you can use:
import scrapy
class SampleMiddleware(object):
# your middleware code here
...
def fullname(o):
return o.__module__ + "." + o.__name__
class MySpider(scrapy.Spider):
name = 'my-spider'
custom_settings = {
'SPIDER_MIDDLEWARES': {
fullname(SampleMiddleware): 500,
}
}
...