I have a project with obout 30 spiders, all scheduled via cron job. Whenever I want to deploy a project I git push to production where a hook will put the files in place.
Now I came accross scrapyd which seems to do both in a more soffisticated way by egifying the scraper and deploying it to the production environment. Looking at the code it seems that this project has come to a halt about 3 years ago. I am wondering if there is an advantage to switch to scrapyd and what the reason is for this code to be so old and no longer under development. Scrapy itself receives regula updates in contrast.
Would you advice to use scrapyd and if yes, why?
I've been using scrapyd for about 2 years, and I do prefer to use it over just launching your jobs using scrapy crawl:
You can set the number of scrapers that can run at the same time using `max_proc_per_cpu. Any scrapers you launch when the max is reached, are put in a queue and launched when a spot is available.
You have a minimalistic GUI in which you can check the queues & read the logs.
Scheduling spiders is easily done with the api-calls. Same for listing the spiders, cancelling spiders, ...
You can use http cache even when running multiple spiders at the same time
You can deploy on multiple servers at once if you want to spread out your crawls over different servers
Related
So we are doing distributed testing of our web-app using JMeter. For that you need to have the jmeter-server.bat file running in background as it acts as sort of a listener. The problem arises when one of the slave machine out of 4 restarts due to the load and the test is effectively stuck right there as the master machine expects some output from the 4th machine. Currently the automation is done via ansible-playbooks which are called in Jenkins. There are more or less 15 tests that are downstream to one another. So even if one test is stuck, the time is wasted until someone check on the machines.
Things I've tried so far:
I've tried using the Windows Task Scheduler and kept the jmeter-server.bat to run without any user loggin in, but it starts the bat file in background which in-turn spawns all the child processes in the background as well i.e. starts Selenium Chrome in headless mode.
I've tried adding the jmeter-server.bat in startup and configuring the system to AutoLogon without any password to trigger a session which will call the startup file. But unfortunately the idea was scrapped by IT for being insecure.
Tried using the ansible playbook by using the win_command but it again gets stuck as the batch file never returns anything.
Created a service as well for the bat file, but again the child processes started in background.
The problem arises when one of the slave machine out of 4 restarts due to the load
Instead of trying to work around the issue I would rather recommend finding the root cause and fixing it.
Make sure to follow JMeter Best Practices
Configure Java to take heap dump on failure
Inspect Windows PerfMon and operating system/application logs
Check presence of .hprof files in the "bin" folder of your JMeter installation and see what do they say
In general using Selenium for conducting the load is not recommended, I would rather suggest using JMeter's HTTP Request samplers for that, given you properly configure JMeter to behave like a real browser from the system under test perspective there won't be any difference whether the load comes from HTTP Request samplers or from the real browser.
The same states documentation on the WebDriver Sampler
Note: It is NOT the intention of this project to replace the HTTP Samplers included in JMeter. Rather it is meant to compliment them by measuring the end user load time.
I'm looking to automate how my bots are updated. They're hosted on a VPS from GalaxyGate and kept alive using PM2. I use VSCode to develop them, then I push to GitHub and pull the code from there to the VPS to update them directly in production. However, this currently has to be done manually and I always forget the git commands that I'm supposed to run.
I'm looking to have this process automated; ideally the server would periodically run some sort of script to update all bots that are hosted on it. It would fetch GitHub and apply the new changes (if any) and properly restart the bots, so I have a few questions:
Is this possible, and if so, how would I go about doing that?
Are there any risks/downsides with such a system?
Any help would be appreciated
so I am currently working in heroku and I need the task to programmatically restart if certain errors arise, so I used the heroku restart API to do so. The problem is that my program is restarting every 10 seconds or so, and it seems that heroku is letting the other web tasks stay online which leads to them later on crashing. I was wondering if there was anyway to only run one task at a time and skip their 60/30 seconds waiting period for when you restart a task. I read somewhere about One - off tasks, but I wasn't sure. I am sorry for the formatting of my question, it is my first post on StackOverflow. Thank you.
I’ve been digging a bit into the way people run integration and e2e tests in the context of Kubernetes and have been quite disappointed by the lack of documentation and feedbacks. I know there are amazing tools such as kind or minikube that allow to run resources locally. But in the context of a CI, and with a bunch of services, it does not seem to be a good fit, for obvious resources reasons. I think there are great opportunities with running tests for:
Validating manifests or helm charts
Validating the well behaving of a component as part of a bigger whole
Validating the global behaviour of a product
The point here is not really about the testing framework but more about the environment on top of which the tests could be run.
Do you share my thought? Have you ever experienced running such kind of tests? Do you have any feedbacks or insights about it?
Thanks a lot
Interesting question and something that I have worked on over the last couple of months for my current employer. Essentially we ship a product as docker images with manifests. When writing e2e tests I want to run the product as close to the customer environment as possible.
Essentially to solve this we have built scripts that interact with our standard cloud provider (GCloud) to create a cluster, deploy the product and then run the tests against it.
For the major cloud providers this is not a difficult tasks but can be time consuming. There are a couple of things that we have learnt the hard way to keep in mind while developing the tests.
Concurrency, this may sound obvious but do think about the number of concurrent builds your CI can run.
Latency from the cloud, don't assume that you will get an instant response to every command that you run in the cloud. Also think about the timeouts. If you bring up a product with lots of pods and services what is the acceptable start up time?
Errors causing build failures, this is an interesting one. We have seen errors in the build due to network errors when communicating with our test deployment. These are nearly always transitive. It is best to avoid these making the build fail.
One thing to look at is GitLab are providing some documentation on how to build and test images in their CI pipeline.
On my side I use travis-ci. I build my container image inside it, then run k8s with kind (https://kind.sigs.k8s.io/) inside travis-CI, and then launch my e2e tests.
Here is some additional information on this blog post: https://k8s-school.fr/resources/en/blog/k8s-ci/
And the scripts to install kind inside travis-ci in 2 lines: https://github.com/k8s-school/kind-travis-ci.git. It allows lots of customization on the k8s side (enable psp, change CNI plugin)
Here is an example: https://github.com/lsst/qserv-operator
Or I use Github Actions CI, which allows to install kind easily: https://github.com/helm/kind-action and provide plenty of features, and free worker nodes for open-source projects.
Here is an example: https://github.com/xrootd/xrootd-k8s-operator
Please note that Github action workers may not scale for large build/e2e tests. Travis-CI scales pretty well.
In my understanding, this workflow coud be moved to an on-premise gitlab CI where your application can interact with other services located inside your network.
One interesting thing is that you do not have to maitain a k8s cluster for your CI, kind will do it for you!
The scrapy doc says that:
Scrapy comes with a built-in service, called “Scrapyd”, which allows you to deploy (aka. upload) your projects and control their spiders using a JSON web service.
is there some advantages in comformance use scrapyd?
Scrapyd allows you to run scrapy on a different machine than the one you are using via a handy web API which means you can just use curl or even a web browser to upload new project versions and run them. Otherwise if you wanted to run Scrapy in the cloud somewhere you would have to scp copy the new spider code and then login with ssh and spawn your scrapy crawl myspider.
Scrapyd will also manage processes for you if you want to run many spiders in parallel; but if you have Scrapy on your local machine and have access to the command-line or a way to run spiders and just want to run one spider at a time, then you're better off running the spider manually.
If you are developing spiders then for sure you don't want to use scrapyd for quick compile/test iterations as it just adds a layer of complexity.