Is there a pipeline concept from within ScrapyD? - scrapy

Looking at the documentation for scrapy and scrapyD it appears the only way you can write the result of a scrape is to write the code in the pipeline of the spider itself. I am being told by my colleagues that there is an additional way whereby I can intercept the result of the scrape from within scrapyD!!
Has anyone heard of this and if so can someone shed some light on this for me please?
Thanks
item exporters
feed exports
scrapyd config

Scrapyd is indeed a service that can be used to schedule crawling processes of your Scrapy application through JSON API. It also permits the integration of Scrapy with different frameworks such as Django, see this guide in case you are interested.
Here is the documentation of Scrapyd.
However if your doubt is about saving the result of your scraping, the standard way is to do so in the pipelines.py file of your Scrapy applicaiton.
An example:
class Pipeline(object):
def __init__(self):
#initialization of your pipeline, maybe connecting to a database or creating a file
def process_item(self, item, spider):
# specify here what it needs to be done with the scraping result of a single page
Remember to define which pipeline are you using in your Scrapy application settings.py:
ITEM_PIPELINES = {
'scrapy_application.pipelines.Pipeline': 100,
}
Source: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

Related

How to collect real-time data from dynamic JS websites using Python Scrapy+Splash?

I am using Scrapy-Splash to scrape real-time data from JavaScript websites. I am using Docker to run Splash. The Spider works completely fine, and I'm getting the required data from the website. However, the Spider crawls once and finishes the process, so I get the data of a particular time. I want to continuously collect the data and store it in a database(i.e MySQL) since the data is updated every second. The crawl needs to be continued and show the data in real-time using plotting libraries(i.e Matplotlib, Plotly). Is there any way to keep the Spider running as the Splash renders the updated data(I'm not sure if Splash updates the data like a normal browser)? Here's my code,
import scrapy
import pandas as pd
from scrapy_splash import SplashRequest
from scrapy.http import Request
class MandalaSpider(scrapy.Spider):
name = 'mandala'
def start_requests(self):
link="website link"
yield SplashRequest(url=link, callback=self.parse)
def parse(self,response):
products=response.css('a.abhead::text').getall()
nub=[]
for prod in products:
prod=prod.strip()
prod=prod.split()
nub.extend(prod)
data=pd.DataFrame({
'Name' : nub[0:len(nub):4],
'Price' : nub[1:len(nub):4],
'Change' : nub[2:len(nub):4],
'Change_percent' : nub[3:len(nub)+1:4],
})
# For single data monitoring
sub=[]
sub=sub.extend(data.iloc[3,1])
yield sub
yield Request(response.url, callback=self.parse, dont_filter=True)
I am completely a newbie in web scraping, so any additional information is greatly appreciated. I have searched other posts from this website but unfortunately couldn't get the solid info that I needed. This type of problem is solved usually using Selenium and BeautifulSoup. But I wanted to use Scrapy.
Thanks in advance.

Scrapy Use both the CORE in the system

I am running scrapy using their internal API and everything is well and good so far. But I noticed that its not fully using the concurrency of 16 as mentioned in the settings. I have changed delay to 0 and everything else I can do. But then looking into the HTTP requests being sent , its clear that scrapy is not exactly downloading 16 sites at all point of times. At some point of time its downloading only 3 to 4 links. And the queue is not empty at that point of time.
When I checked the core usage , what i found was that out of 2 core , one is 100% and other is mostly idle.
That is when i got to know that twisted library on top which scrapy is build is single threaded and that is why its only using single core.
Is there any workaround to convince scrapy to use all the core ?
Scrapy is based on the twisted framework. Twisted is event loop based framework, so it does scheduled processing and not multiprocessing. That's is why your scrapy crawl runs on just one process. Now you can technically start two spiders using the below code
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
And there is nothing that stops you from having the same class for both the spiders.
process.crawl method takes *args and **kwargs to pass to your spider. So you can parametrize your spiders using this approach. Let's say your spider is suppose to crawl 100 pages, you can add a start and end parameter to your crawler class and do something like below
process.crawl(YourSpider, start=0, end=50)
process.crawl(YourSpider, start=51, end=100)
Note, that both the crawlers will have their own settings, so if you have 16 requests set for your spider, then both combined will effectively have 32.
In most cases scraping is less about CPU and more about Network access, which is actually non-blocking in case of twisted, so I am not sure this would give you a very huge advantage against setting the CONCURRENT_REQUEST to 32 in a single spider.
PS: Consider reading this page to understand more https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
Another option is to run your spiders using Scrapyd, which lets you run multiple processes concurrently. See max_proc and max_proc_per_cpu options in the documentation. If you don't want to solve your problem programmatically, this could be the way to go.

Generate interactive API docs from Tornado web server code

I have a Tornado web server that exposes some endpoints in its API.
I want to be able to document my handlers (endpoints) in-code, including description, parameters, example, response structure, etc., and afterwards generate an interactive documentation that enables one to "play" with my API, easily make requests and experience the response on a sandbox environment.
I know Swagger, and particularly their SwaggerUI solution is one of the best tools for that, but I get confused how it works. I understand that I need to feed the SwaggerUI engine some .yaml that defines my API, but how do I generate it from my code?
Many github libraries I found aren't good enough or only support Flask...
Thanks
To my understanding, SwaggerUI is dependent on swagger specification.
So, it boils down to generating the Swagger Specification in a clean and elegant manner.
Did you get a chance to look at apispec?
I am finding this to be an active project with a plugin for tornado.
Here's how we are doing it in our project. We made our own module and we are still actively developing this. For more info: https://pypi.org/project/tornado-swirl/
import tornado.web
import tornado_swirl as swirl
#swirl.restapi('/item/(?P<itemid>\d+)')
class ItemHandler(tornado.web.RequestHandler):
def get(self, itemid):
"""Get Item data.
Gets Item data from database.
Path Parameter:
itemid (int) -- The item id
"""
pass
#swirl.schema
class User(object):
"""This is the user class
Your usual long description.
Properties:
name (string) -- required. Name of user
age (int) -- Age of user
"""
pass
def make_app():
return swirl.Application(swirl.api_routes())
if __name__ == "__main__":
app = make_app()
app.listen(8888)
tornado.ioloop.IOLoop.current().start()

Scrapy - download images from image url list

Scrapy has ImagesPipeline that helps download image. the process is
Spider: start a link and parse all image urls in response, and save
image urls to items.
ImagesPipeline: items['image_urls'] are processed by ImagesPipeline.
But what if I don't need spider parts and have 100k images URLs ready to be downloaded, for example read URLs from redis, how do I call ImagePipeline directly to download the image?
I know I could simply make Request in spider and save response, but I'd like to see if there is way use default ImagesPipeline to save images directly.
I don't think that the use case you describe is the best fit for Scrapy. Wget would work fine for such a constrained problem.
If you really need to use Scrapy for this, make a dummy request to some URL:
def start_requests(self):
request = Request('http://example.com')
# load from redis
redis_img_urls = ...
request.meta['redis_img_urls'] = redis_img_urls
yield request
Then on the parse() method return:
def parse(self, response):
return {'image_urls':request.meta['redis_img_urls'] }
This is ugly but it should work fine...
P.S. I'm not aware of any easy way to bypass the dummy request and inject and Item directly. I'm sure there's one but it's such an unusual thing to do.
The idea behind a scrapy Pipeline is to process the items the spider generates explained here.
Now scrapy isn't about "downloading" staff, but a way to create crawlers, spiders, so if you have a list with urls to "download", then just use for loop and download them.
If you still want to use a scrapy Pipeline, then you'll have to return an item with that list inside the image_urls field.
def start_requests(self):
yield Request('http://httpbin.org/ip', callback=self.parse)
def parse(self, response):
...
yield {'image_urls': [your list]}
Then enable the pipeline on settings.

is there a command to make scrapy stop collecting links?

I have a spider currently crawling and I want it to now stop collecting links and just crawl everything it has collected, is there a way to do this? I cannot find anything so far.
scrapy offers different ways to stop the spider (apart from calling ctrl+c), that you can find on the CloseSpider extension
You can put that on your settings.py file, so something like:
CLOSESPIDER_TIMEOUT = 20 # to stop crawling when reaching 20 seconds