How to run Scrapy spiders in Celery - scrapy

There are several posts regarding how to setup Scrapy within a Celery task avoiding to restart the Twister reactor to prevent the twisted.internet.error.ReactorNotRestartable error. I have tried using CrawlerRunner as recommended on docs and crochet but to make it work the following lines have to be removed from the code:
d.addBoth(lambda _: reactor.stop())
reactor.run() # Script blocks here until spider finishes.
This is the full code:
from scrapy.crawler import CrawlerRunner
#app.task(bind=True, name="run_spider")
def run_spider(self, my_spider_class, my_spider_settings):
from crochet import setup
setup() # Fixes the issue with Twisted reactor.
runner = CrawlerRunner(my_spider_settings)
crawler = runner.create_crawler(my_spider_class)
runner.crawl(crawler)
d = runner.join()
def spider_finished(_): <-- This function is called when spider finishes.
logger.info("{} finished:\n{}".format(
spider_name,
json.dumps(
crawler.stats.spider_stats.get(spider_name, {}),
cls=DjangoJSONEncoder,
indent=4
)
))
d.addBoth(spider_finished)
return f"{spider_name} started" <-- How to block execution until spider finishes?
Now works but the Spider seems to be running detached from the task, so I get the return response before the spider is finished. I have used a workaround with the callback spider_finished() but is not ideal because the celery worker keeps running an executing other tasks and eventually kills the process affecting the detached spiders.
Is there a way to block the execution of the task until the Scrapy spider is done?

From doc:
It’s recommended you use CrawlerRunner instead of CrawlerProcess if
your application is already using Twisted and you want to run Scrapy
in the same reactor.
Celery does not use a reactor, so u cant use CrawlerRunner.
Run a Scrapy spider in a Celery Task

Related

Selenium and Chromedriver on Lambda - Inconsistency at scale

I have a basic selenium script running/deployed on aws lambda.
Chrome and chromedriver and installed as layers (and available on /opt) via serverless.
The script works ... but only some of the time and rarely at scale (invoking more than 5 instances asynchronously).
I invoke the function in a simple for loop (up to about 200 iterations)
response = client.invoke(
FunctionName='arn:aws:lambda:us-east-1:12345667:function:selenium-lambda-dev-hello',
InvocationType='Event', #|'RequestResponse'|'Event' (async)| DryRun'
LogType='Tail',
#ClientContext='string',
Payload=event_payload,
#Qualifier='24'
)
Other runs, the process hangs while initiating the selenium driver on this line
driver = webdriver.Chrome('/opt/chromedriver_89', chrome_options=options)
other iterations the drivers fail/throw a 'timeout waiting for renderer exception'
This I believe is often due to a mismatch of chromedriver/chrome. I have checked and verified my versions are matched up and compatible (and like i said they do work sometimes).
I guess i'm looking for some ideas/direction to even begin to troubleshoot this. I was under the impression that each invokation of a lambda function is in a separate environment, so why would increasing the volume of invokations have any adverse effect on how well my script runs?
Any ideas or thoughts would be greatly appreciated all!
Discussed in the comments.
There's no complete solution but what has helped improve the situation was increasing the memory of the Lambda service.
The alternatives to try/consider:
Don't use Chrome. Use requests and lxml to query the pages via the network level and remove the need for Chrome.
I did something similar to support another stack question recently. You can see it's similar but not quite the same.
Go to a url and get some text from an xpath:
from lxml import html
import requests
import json
url = "https://nonfungible.com/market/history"
response = requests.get(url)
page = html.fromstring(response.content)
datastring = page.xpath('//script[#id="__NEXT_DATA__"]/text()')
Don't use Chrome. Chrome is ideal for functional testing - but if that's not you objective consider using HtmlUnitDriver or phantomjs. Both of these are significantly lighter than chrome and won't require a browser to be installed (you can run it straight from libraries)
You only need to change the driver initialisation and the rest of the script (in theory) should work.
PhantomJS:
$ pip install selenium
$ brew install phantomjs
from selenium import webdriver
driver = webdriver.PhantomJS()
Unit Driver:
from selenium import webdriver
driver = webdriver.Remote(
desired_capabilities=webdriver.DesiredCapabilities.HTMLUNIT)

running scrapy CrawlerProcess as async

i would like to run scrapy along with another asyncio script in the same file, but am unable to.
(using the asyncio reactor in the settings:
# TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor")
asyncio.run(playwright_script())
process = CrawlerProcess(get_project_settings())
process.crawl(TestSpider)
process.start()
this raises the error:
RuntimeError: There is no current event loop in thread 'MainThread'.
i'm guessing because the previous run command blocks the event loop?
i have tried to add the scrapy commands into a coroutine like so:
async def run_scrapy():
process = CrawlerProcess(get_project_settings())
process.crawl(TestSpider)
process.start()
asyncio.run(run_scrapy_script())
but receive the error:
RuntimeError: This event loop is already running
How do i properly configure the scrapy CrawlerProcess to run on my asyncio loop?

I typed `scrapy version` but it trigger or load other spiders in the folder

I'm relatively new to Scrapy, I just type scrapy version like below;
But it did trigger spiders in the folder.
Obviously, I have some spiders in development, for example one spider opens a Chrome web driver in the init method, just typing Scrapy version opens the Chrome browser. Why Scrapy is loading all the spiders in the folder? How to avoid this?
(django_corp_data):~/sherlockit$ scrapy version
['version']
corp_data.spiders.quote
version
Scrapy 1.8.0
I think I made a rookie mistake.
I created the spider using the following command scrapy genspider quotes quotes.com
The above command will create your spider as class QuotesSpider(scrapy.Spider)
If you just type scrapy version or scrapy crawl the scrapy framework will load all your scrapy.Spider classed spiders.
If you are building multiple spiders, create your spider using scrapy crawl template, like scrapy genspider -t crawl quotes quotes.com notice the class of your newly created spider. class QuotesSpider(CrawlSpider)
It took me about a day to figure the above, hope this will help someone.

Can we just run the chromedriver once with Selenium while repeat automation?

I use Selenium to perform automation.
However, if I repeat automation, I have a lot of chromedrivers.
I want to solve this problem.
Can't you just run one chromedriver?
The reason you're left with the *driver.exe processes is most probably because you are not explicitly closing them at the end of your tests run - calling the quit() method on the driver object in your chosen language.
That step is usually done in the object's destructor - if you're using object-oriented approach; a finally block if it's exception-handling; or the the program's/script's exit lines. Most high-level frameworks (Cucumber, TestNG, Robotframework, the plethora of unit testing in the different languages) have some kind of "tear-down" blocks which is usually used for this purpose.
Why is this happening?
When you start your automation, the OS starts the process for it; when you instantiate a webdriver object, it spawns a process for the browser's driver - the "chromedriver.exe" in your case. The next step is to open the browser instance - the "chrome.exe".
When your run finishes, the process for it is closed. But, if you have not explicitly called the quit() method - the browser's driver persists, "stays alive"; and is now an orphaned process (not to be mistaken with a zombie one, which is a totally different thing) - fully functional, yet - with no program to command it.
In fact, at this stage - having a working driver process and a browser, you can reconnect to it, and use in future runs. Check out how and why here - https://stackoverflow.com/a/52003231/3446126
As per the screenshot you have provided there seems to be presence of a couple of Zombie ChromeDriver process within your system.
Answering straight, you can't work with initiating just one ChromeDriver process while repeat automation as you can not reconnect to the previous browsing session. You can find a detailed discussion in How can I reconnect to the browser opened by webdriver with selenium?
You code trials would have given us more insights why the ChromeDriver processes are not getting cleaned up. As per best practices always invoke driver.quit() within tearDown(){} method to close & destroy the WebDriver and Web Client instances gracefully as follows:
driver.quit() // Python
//or
driver.quit(); // Java
//
driver.Quit(); // DotNet
You can find a detailed discussion in PhantomJS web driver stays in memory
Incase the ChromeDriver processes are still not destroyed and removed you may require to kill the processes from tasklist. You can find a detailed discussion in Selenium : How to stop geckodriver process impacting PC memory, without calling driver.quit()?
Python Solution(Cross Platform):
import os
import psutil
PROCNAME = "geckodriver" # or chromedriver or IEDriverServer
for proc in psutil.process_iter():
# check whether the process name matches
if proc.name() == PROCNAME:
proc.kill()
I use a cleanup routine at the end of my tests to get rid of zombie processes, temp files, etc. (Windows, Java)
import org.apache.commons.io.FileUtils;
import java.util.logging.Logger;
import java.io.File;
import java.io.IOException;
public class TestCleanup {
protected static Logger logger = Logger.getLogger(TestCleanup.class.getName());
public void removeTemporaryFiles() {
try {
FileUtils.cleanDirectory(new File(getPropertyValue("java.io.tmpdir")));
} catch (IOException ioException) {
logger.info("IOexception trying to clean the downloads directory: " + ioException.getMessage());
}
}
public void closeWebdriver() {
webDriver.close();
}
public void killOperatingSystemSessions() {
try {
Process process = Runtime.getRuntime().exec("taskkill /IM \"chromedriver.exe\" /F");
} catch (IOException ioException) {
logger.info("Exception when shutting down chromedriver process: " + ioException.getMessage());
}
}
}

Run a Scrapy spider in a Celery Task

This is not working anymore, scrapy's API has changed.
Now the documentation feature a way to "Run Scrapy from a script" but I get the ReactorNotRestartable error.
My task:
from celery import Task
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.utils.project import get_project_settings
from .spiders import MySpider
class MyTask(Task):
def run(self, *args, **kwargs):
spider = MySpider
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
The twisted reactor cannot be restarted. A work around for this is to let the celery task fork a new child process for each crawl you want to execute as proposed in the following post:
Running Scrapy spiders in a Celery task
This gets around the "reactor cannot be restart-able" issue by utilizing the multiprocessing package. But the problem with this is that the workaround is now obsolete with the latest celery version due to the fact that you will instead run into another issue where a daemon process can't spawn sub processes. So in order for the workaround to work you need to go down in celery version.
Yes, and the scrapy API has changed. But with minor modifications (import Crawler instead of CrawlerProcess). You can get the workaround to work by going down in celery version.
The Celery issue can be found here:
Celery Issue #1709
Here is my updated crawl-script that works with newer celery versions by utilizing billiard instead of multiprocessing:
from scrapy.crawler import Crawler
from scrapy.conf import settings
from myspider import MySpider
from scrapy import log, project
from twisted.internet import reactor
from billiard import Process
from scrapy.utils.project import get_project_settings
from scrapy import signals
class UrlCrawlerScript(Process):
def __init__(self, spider):
Process.__init__(self)
settings = get_project_settings()
self.crawler = Crawler(settings)
self.crawler.configure()
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
self.spider = spider
def run(self):
self.crawler.crawl(self.spider)
self.crawler.start()
reactor.run()
def run_spider(url):
spider = MySpider(url)
crawler = UrlCrawlerScript(spider)
crawler.start()
crawler.join()
Edit: By reading the celery issue #1709 they suggest to use billiard instead of multiprocessing in order for the subprocess limitation to be lifted. In other words we should try billiard and see if it works!
Edit 2: Yes, by using billiard, my script works with the latest celery build! See my updated script.
The Twisted reactor cannot be restarted, so once one spider finishes running and crawler stops the reactor implicitly, that worker is useless.
As posted in the answers to that other question, all you need to do is kill the worker which ran your spider and replace it with a fresh one, which prevents the reactor from being started and stopped more than once. To do this, just set:
CELERYD_MAX_TASKS_PER_CHILD = 1
The downside is that you're not really using the Twisted reactor to its full potential and wasting resources running multiple reactors, as one reactor can run multiple spiders at once in a single process. A better approach is to run one reactor per worker (or even one reactor globally) and don't let crawler touch it.
I'm working on this for a very similar project, so I'll update this post if I make any progress.
To avoid ReactorNotRestartable error when running Scrapy in Celery Tasks Queue I've used threads. The same approach used to run Twisted reactor several times in one app. Scrapy also used Twisted, so we can do the same way.
Here is the code:
from threading import Thread
from scrapy.crawler import CrawlerProcess
import scrapy
class MySpider(scrapy.Spider):
name = 'my_spider'
class MyCrawler:
spider_settings = {}
def run_crawler(self):
process = CrawlerProcess(self.spider_settings)
process.crawl(MySpider)
Thread(target=process.start).start()
Don't forget to increase CELERYD_CONCURRENCY for celery.
CELERYD_CONCURRENCY = 10
works fine for me.
This is not blocking process running, but anyway scrapy best practice is to process data in callbacks. Just do this way:
for crawler in process.crawlers:
crawler.spider.save_result_callback = some_callback
crawler.spider.save_result_callback_params = some_callback_params
Thread(target=process.start).start()
I would say this approach is very inefficient if you have a lot of tasks to process.
Because Celery is threaded - runs every task within its own thread.
Let's say with RabbitMQ as a broker you can pass >10K q/s.
With Celery this would potentially cause to 10K threads overhead!
I would advise not to use celery here. Instead access the broker directly!