How to manage multiple spiders in a scrapy project - scrapy

I am new to scrapy but have successfully created a fairly sophisticated spider. Now I want to add a few more to the same project. I tried copying my working spider and editing it to work with another target but I am getting all sorts of global variable errors. I have tried "scrapy crawl my_new_spider" but is seems that all spiders are being initiated. What gives? Should just add a new class in the existing spider? This doesn't seem scalable... any pointers would be appreciated. The docs got me pretty far but I am stumbling now.
Many thanks!

What I understand from your question, best way to add more spider is by adding new class in a new file under the spiders folder,
try to give separate names to each of the spider. Using this structure you can share your items.py, settings.py etc for all spiders under same project.
tutorial/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
......
and in spider1 and spider2 you can set names accordingly, like
name= "spider1" and name="spider2"
so that you can run your spiders as
scrapy crawl spider_name

Related

How to proper integrate PrismJS into a Eleventy project?

I'm building a site using eleventy and want to include code examples with code-highlighting. Prism looks like a great choice for this. How would I add it proper to the build process (not as CDN)?
Use the download option. This gives you the JS/CSS you need. Copy it to your site and ensure you are using the "Passthrough File Copy" (https://www.11ty.dev/docs/copy/) option to copy CSS and JS files over.

What's the easiest way to have "settings profiles" in Scrapy?

Scrapy picks up settings from settings.py (there are default settings, project settings, per-spider settings as well). What I'm looking for is being able to have more than one file with settings and being able to switch between them as I launch my spiders quickly. If there is some inheritance between files that would be awesome too.
If you know Spring Boot from Java world there is an idea of profile. You have application.settings file with your base settings. And then you can have application-dev.settings and application-prod.settings. If you run your application with option -Dspring.profiles.active=dev then it will pick up application.settings and add application-dev.settings on top of it. This way you can maintain multiple configurations in parallel and rapidly switch between them.
I've found an approach for Scrapy with no supporting code required. The approach is to use SCRAPY_SETTINGS_MODULE and import base settings file in my dev and prod modules. Are there any other approaches that you use?
Launch line in my case would look like:
export SCRAPY_SETTINGS_MODULE=projectname.profiles.dev && scrapy crawl myspider
Firstly, if you're only going to change one or two values, then it would be simpler to use a single dynamic settings.py (as mentioned in Gallaecio's answer).
However, if you really need separate settings, there is an even shorter way by defining separate "projects" in scrapy.cfg (docs):
[settings]
default = myproject.settings.dev
dev = myproject.settings.dev
prod = myproject.settings.prod
Then to run a specific one:
SCRAPY_PROJECT=prod scrapy crawl myspider
SCRAPY_PROJECT=dev scrapy crawl myspider
If you don't specify SCRAPY_PROJECT it will use default.
And yes, you can inherit from settings files. Replace your settings.py file with a module instead:
myproject/settings/__init__.py
myproject/settings/base.py
myproject/settings/dev.py
myproject/settings/prod.py
In base.py you can have exactly what you have in settings.py. Then at the top of each override file you add:
from .base import *
# Override settings in the same way as if they were declared in settings.py
That wildcard import is usually a bad practice, but in this case since it's just a plain Python file so the end result is just having all the variables available. This is a trick we often use in Django (example).
I believe SCRAPY_SETTINGS_MODULE is the best approach.
Alternatively, since a settings module is a Python script, you could change settings dynamically from within settings.py. I’ve seen this done, for example, to detect automatically whether a spider is running in a local machine or on a Scrapyd server, and adjust the settings accordingly at run time.

ImportError: No module named my project (sys.path is correct)

This is kind of embarassing because of how simple and common this problem is but I feel I've checked everything.
WSGI file is located at: /var/www/igfakes/server.wsgi
Apache is complaining that I can't import the module of my project, so I decided to start up a Python shell and see if it's any different - nope.
All the proof is in the following screenshot, I'll walk you through it.
First, see I cannot import my project
Then I import sys and check the path
Note /var/www in the path
Leave python
Check the directory, then confirm my project is in that same directory
My project is exactly where I'm specifying. Any idea what's going on?
I've followed a few different tutorials, all with the same instructions, like this one.

Dropwizard serve external images directory

I have a dropwizard API app and I want one endpoint where I can run the call and also upload and image, these images have to be saved in a directory and then served through the same application context.
Is it possible with dropwizard? I can only find static assets bundles.
There is similar question already: Can DropWizard serve assets from outside the jar file?
The above module is mentioned in the third party modules list of dropwizard. There is also official modules list. These two lists are hard to find maybe because the main documentation doesn't reference them.
There is also dropwizard-file-assets which seems new. I don't know which module will work best for your case. Both are based on dropwizard's AssetServlet
If you don't like them you could use it as example how to implement your own. I suspect that the resource caching part may not be appropriate for your use case if someone replace the same resource name with new content: https://github.com/dirkraft/dropwizard-file-assets/blob/master/src/main/java/com/github/dirkraft/dropwizard/fileassets/FileAssetServlet.java#L129-L141
Edit: This is simple project that I've made using dropwizard-configurable-assets-bundle. Follow the instructions in the README.md. I think it is doing exactly what you want: put some files in a directory somewhere on the file system (outside the project source code) and serve them if they exist.

Combine all my custom JS into one single file with dojo build

I'm having a hard time trying to set up dojo build in my project.
Basically, I have my js folder with all my custom widgets and components. I simply want to combine all javascript files form js folder into one single file.
dojo sources are located outside this folder. The structure looks similar to this:
/public
/prod
/dojo-1.9
/dijit
/dojo
/dojox
/js
myScript1.js
myScript2.js
Do you have any idea on how should I configure the package.json and profile.js? The documentation doesn't seem to help since all I am getting is an output folder with the same contents as the js folder (no javascript is merged).
You can start by reading this article:
https://dojotoolkit.org/reference-guide/1.10/build/simpleExample.html
It provides a simplified overview of dojo build system.
Additional there is dojo boilerplate with a sample of folder structure and profile.js configuration for quick start here:
https://github.com/csnover/dojo-boilerplate
I definitely suggest you to use the boilerplate as start for your project as it simplify a lot initial configurations.