How to use middlewares when using scrapy runspider command?

How to use middlewares when using scrapy runspider command? - scrapy

I know that we can configure middlewares in settings.py when we have a scrapy project.
I haven't started a scrapy project, and I use runspider command to run spider, but I want to use some middlewares. How to set it in the spider file?

So, the problem is, when you run a spider using scrapy runspider my_file.py, you can use the -s option to pass only simple scalar spider settings (like strings or integers). The problem is, the SPIDER_MIDDLEWARES setting expects a dictionary, and there isn't a really straight-forward way to pass that through the command-line.
Currently, the only way I know to set SPIDER_MIDDLEWARES settings for a spider without a project is using custom spider settings, which is currently available in Scrapy from the code repo (not officially released yet) since Scrapy 1.0.
If you go that route, you can put your middlewares in a file middlewares.py and do:
import middlewares # need this, or you get import error
class MySpider(scrapy.Spider):
name = 'my-spider'
custom_settings = {
'SPIDER_MIDDLEWARES': {
'middlewares.SampleMiddleware': 500,
}
}
...
Alternatively, if you're putting the middleware class in the same file, you can use:
import scrapy
class SampleMiddleware(object):
# your middleware code here
...
def fullname(o):
return o.__module__ + "." + o.__name__
class MySpider(scrapy.Spider):
name = 'my-spider'
custom_settings = {
'SPIDER_MIDDLEWARES': {
fullname(SampleMiddleware): 500,
}
}
...

Related

Swagger-UI and Ktor how to import swagger.json or .yaml file and start Swagger-UI?

I have a problem, I have already generated OpenApi Swagger files (swagger.json and swagger.yaml) Is it possible somehow in Ktor (kotlin) import this files and start swagger-ui server on specific route ?
I have visited ktor-swagger project but could get how just to add json file to display swagger-ui.
Any suggestions ?

Make you json file accessible from static routing
Create "files" directory in resources dir
Use next code to routing static files from "files" dir
routing {
static("/static") {
resources("files")
}
//...
}
Lets name json file "test.json", then put you it in "files" folder
so now you run the app and see this file by http://localhost:8080/static/test.json
Then we need to install and configure OpenApiGen
Add OpenApiGen dependencies you can find how to do this here https://github.com/papsign/Ktor-OpenAPI-Generator
Then use the following code
install(OpenAPIGen) {
serveSwaggerUi = true
swaggerUiPath = "/swagger-ui"
}
now you can use http://localhost:8080/swagger-ui/index.html?url=/static/test.json URL to see your file through the swagger UI
Also, you can provide your own route to using redirection
get("/api") {
call.respondRedirect("/swagger-ui/index.html?url=/static/test.json", true)
}
this will allow you to use just http://localhost:8080/api

Vue CLI 3 Nightwatch page object configuration

I'm using Vue CLI 3 version 3.0.5.
In project configuration, I use Nightwatch as e2e test tool.
I try to use page objects, so I had nightwatch.config.js file in project root, and add page_objects_path inside like below:
{
page_objects_path : "/tests/e2e/page-objects"
}
Then I create page-objects folder as this path: /tests/e2e/page-objects.
Then I setup a page object Entry.js under that folder and try to use it in test:
/tests/e2e/page-objects/Entry.js
vmodule.exports = {
'Test Page Object': browser => {
browser
.url(process.env.VUE_DEV_SERVER_URL)
.waitForElementVisible('#app', 5000)
browser.page.Entry().sayHello()
browser.end()
}
}
And the error message shows:
Cannot read property 'Entry' of undefined .
It looks like my page object setup is not correct...
Could anyone help providing a correct implementation of NightWatch page object in Vue CLI v3.0.5 ? Thanks...

Ah, I know why it won't work.
Because nightwatch.config.js is a javascript file, I should export it first, then the plugin can read it.
module.export = {
page_objects_path : "/tests/e2e/page-objects"
}
Sorry for the dumb question.

Scrapy : Preserving a website

I'm trying to save a copy the pyparsing project on wikispaces.com before they take wikispaces down at the end of the month.
It seems odd (perhaps my version of google is broken ^_^) but I can't find any examples of duplicating/copying a site as. That is, as one views it upon a browser. SO has this and this on the topic but they are just saving the text, strictly the HTML/DOM structure, for the site. Unless I'm mistaken these asnwers do not appear to save the images/header link files/javascript and related information necessary to render the page. Further examples I have seen are more concerned with extraction of parts of the page and not duplicating it as is.
I was wondering if anyone had any experience with this sort of thing or could point me to a useful blog/doc somewhere. I've used WinHTTrack in the past but the robots.txt or the pyparsing.wikispaces.com/auth/ route are preventing it from running properly and I figured I'd get some scrapy experience in.
For those interested to see what I have tried thus far. Here is my crawl spider implementation, that acknowledges the robots.txt file
import scrapy
from scrapy.spiders import SitemapSpider
from urllib.parse import urlparse
from pathlib import Path
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class PyparsingSpider(CrawlSpider):
name = 'pyparsing'
allowed_domains = ['pyparsing.wikispaces.com']
start_urls = ['http://pyparsing.wikispaces.com/']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
# i = {}
# #i['domain_id'] = response.xpath('//input[#id="sid"]/#value').extract()
# #i['name'] = response.xpath('//div[#id="name"]').extract()
# #i['description'] = response.xpath('//div[#id="description"]').extract()
# return i
page = urlparse(response.url)
path = Path(page.netloc)/Path("" if page.path == "/" else page.path[1:])
if path.parent : path.parent.mkdir(parents = True, exist_ok=True) # Creates the folder
path = path.with_suffix(".html")
with open(path, 'wb') as file:
file.write(response.body)
Trying the same thing with the sitemap spider is similar. The first SO link provides an implementation with a plain spider.
import scrapy
from scrapy.spiders import SitemapSpider
from urllib.parse import urlparse
from pathlib import Path
class PyParsingSiteMap(SitemapSpider) :
name = "pyparsing"
sitemap_urls = [
'http://pyparsing.wikispaces.com/sitemap.xml',
# 'http://pyparsing.wikispaces.com/robots.txt',
]
allowed_domains = ['pyparsing.wikispaces.com']
start_urls = ['http://pyparsing.wikispaces.com'] # "/home"
custom_settings = {
"ROBOTSTXT_OBEY" : False
}
def parse(self, response) :
page = urlparse(response.url)
path = Path(page.netloc)/Path("" if page.path == "/" else page.path[1:])
if path.parent : path.parent.mkdir(parents = True, exist_ok=True) # Creates the folder
path = path.with_suffix(".html")
with open(path, 'wb') as file:
file.write(response.body)
None of these spiders collect more then the HTML structure
Also I have found that the links, ..., that are saved do not appear to point to proper relative paths. Atleast, when opening the saved files, the links point to a path relative to the hard drive and not relative to the file. While opening a page via http.server the links point to dead locations, presumably the .html extension is the trouble here. It might be necessary to remap/replace links in the stored structure.

How to bundle a plugin which requires multiple files in order to work

I'm facing an issue when I try to bundle Aurelia-hammer with the CLI.
The app still keeps pulling hammer-swipe.js, hammer-tap.js,... from the node_modules folder.
When I inspect the plugin's AMD structure, these are defined as global resources:
function configure(frameworkConfig) {
frameworkConfig.globalResources('./hammer-swipe');
frameworkConfig.globalResources('./hammer-tap');
frameworkConfig.globalResources('./hammer-press');
frameworkConfig.globalResources('./hammer-hold');}
Is there any way to bundle these with the CLI? I tried adding these files to the "resources" element in aurelia.json without success.

the plugin author should export those classes: (HammerPressCustomAttribute...) so they could be traced properly. But you can dummy-import theme yourself as a workaround:
import { HammerPressCustomAttribute } from 'aurelia-hammer/hammer-press';
import { HammerSwipeCustomAttribute } from 'aurelia-hammer/hammer-swipe';
import { HammerTapeCustomAttribute } from 'aurelia-hammer/hammer-tap';
normally you have to do this as well:
import { HammerHoldCustomAttribute } from 'aurelia-hammer/hammer-hold';
but the class exported from hammer-hold.js is named HammerPressCustomAttribute (oops looks like copy-paste issue) so just reference the file even with a non existent class.
import { HammerHoldCustomAttribute } from 'aurelia-hammer/hammer-hold';
this should fix your problem (I hope). It's best to open an issue in the plugin repo and ask the author to export those classes (and rename the duplicate one).

Using gon with jasmine-rails

Our backbone app uses gon. When we try to run our tests, we are getting a gon is undefined error in the console of the browser. Our layout file includes a call to include_gon, but that file is not being loaded by jasmine, so jasmine is failing in our first javascript file that contains gon. We tried creating a helper to assign the gon variable to an empty hash (like a fixture), but the helper was called after the first call to gon and therefore didn't fix our issue.

The secret was to use the asset pipeline to define the load order of my files. I commented out these lines from jasmine.yml
# path to parent directory of src_files
# relative path from Rails.root
# defaults to app/assets/javascripts
#src_dir: "app/assets/javascripts"
# list of file expressions to include as source files
# relative path from src_dir
#src_files:
# - "application.{js.coffee,js,coffee}"
And created spec.js.coffee with these lines:
#= require application
#= require jasmine-jquery
Now my js files get loaded in order and I am good to go.

What worked for me was to define window.gon = {} inside the test like so:
describe("A suite is just a function", function() {
beforeEach(function() {
window.gon = {}
gon.test = "this"
})
it ("should find gon", function() {
expect(gon.test).toBe("this")
})
})

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to use middlewares when using scrapy runspider command? - scrapy

I know that we can configure middlewares in settings.py when we have a scrapy project. I haven't started a scrapy project, and I use runspider command to run spider, but I want to use some middlewares. How to set it in the spider file?

Related

Swagger-UI and Ktor how to import swagger.json or .yaml file and start Swagger-UI?

Vue CLI 3 Nightwatch page object configuration

Scrapy : Preserving a website

How to bundle a plugin which requires multiple files in order to work

Using gon with jasmine-rails

Categories

Resources