Scrapyd: No active project - How to schedule spiders with scrapyd - scrapy

I am trying to schedule a scrapy 2.1.0 spider with the help of scrapyd 1.2
curl --insecure http://localhost:6800/schedule.json -d project=bid -d spider=test
This should in theory start the crawl for spider test within project bid. Instead it outputs the error message:
{"node_name": "spider1", "status": "error", "message": "Scrapy 2.1.0 - no active project\n\nUnknown command: list\n\nUse \"scrapy\" to see available commands\n"}
If I cd into the project directory there is the project with several spiders which I can start via "cd /var/spiders/ && scrapy crawl test &".
However beeing in another folder would also give me the message "no active projects":
/var$ scrapy list
Scrapy 2.1.0 - no active project
Unknown command: list
Use "scrapy" to see available commands
This looks like the exact same info I get from scrapyd, so I suspect that I need to configure somehow the working directory where my projects live.
Scrapyd is running and I can access the console via web "gui".
What is the right approach to start the job via scrapyd?

Before you can launch your spider with scrapyd, you'll have to deploy your spider first. You can do this by:
Using addversion.json (https://scrapyd.readthedocs.io/en/latest/api.html#addversion-json)
Using scrapyd-deploy (https://github.com/scrapy/scrapyd-client)

Related

Python cron job with Chrome not running in AWS EC2

I've been using an EC2 instance to run a python script with cron everyday for a month or so. The script uses selenium.
Everything was working correctly until today, when my script did not run.
I have tried to run it manually but it's not working either. The error message says that
raise exception_class(message, screen, stacktrace) selenium.common.exceptions.
NoSuchElementException: Message: no such element: Unable to locate element:
{"method":"cssselector","selector":"#ctl00_ctl00_moteurRapideOffre_
ctl01_EngineCriteriaCollection_Contract > option:nth-child(5)"}
(Session info: headless chrome=90.0.4430.85)
However, the same script is running fine on my computer (ie on my Macbook, not on AWS EC2).
As the problem seems to come from Chrome, I uninstalled it on AWS EC2 using:
sudo yum remove google-chrome-stable
Then I reinstalled it using :
curl https://intoli.com/install-google-chrome.sh | bash
sudo mv /usr/bin/google-chrome-stable /usr/bin/google-chrome
google-chrome --version && which google-chrome
If I try to run Chrome on the EC2 using /usr/bin/google-chrome, it does not work and it displays the following error message :
ERROR:browser_main_loop.cc(1386)] Unable to open X display.
I don't know if it was working before as I have never used it this way. But it seems to be a problem.
I have seen on the web that it might come from the fact that there is no screen and that I should use a package named xvfb. I have tried to install it with the following code:
sudo yum install xorg-x11-server-Xvfb
I guess the package was correclty installed, but it is not working better.
To sum up, I think my problem in the python code is linked to the fact that Google Chrome is not working correclty and this might be linked to xvfb. But I am not sure at all, it is just what I have tried until now.
Could you please help me ? Thanks!
You can simply add setup your like this, runs after every 30 minutes
*/30 * * * * export DISPLAY=:0 && ,<do what ever you want.>
If this does not work, and you google-chrome or firefox not found, simply run the command below in your shell BASH, FISH, ZSH etc to get PATH.
echo $PATH
Whatever the result comes out from the above command just copy and paste it above your cronjob like this,
*/30 * * * * export DISPLAY=:0 && ,<your selenium script.>```
You can remove export ```export DISPLAY=:0``` line if you want to this in the background or make your driver headless.
The reason of doing this, you might install the respective from snapd etc and that's why path is not defined as you downloaded from separate resource.

How do I delete/unregister a GitLab runner

I have registered a personal GitLab runner several months ago, which I no longer use. How do I completely delete it so that it does not show up on my GitLab CI/CD settings page?
List runners to get their tokens and URLs:
sudo gitlab-runner list
Verify with delete option specifying runner's token and URL:
sudo gitlab-runner verify --delete -t YMsSCHnjGssdmz1JRoxx -u http://git.xxxx.com/
Get your runner token and id
First, go to the GitLab settings page and find the token (e.g. 250cff81 in the image below) and the id (e.g. 354472 in the image below) of the GitLab runner which you wish to delete.
Use the gitlab-runner CLI to unregister the runner
If you have access to the machine which was used to register the GitLab runner, you can unregister the runner using the following command, where you replace {TOKEN} with the token of your GitLab runner (e.g. 250cff81 in the example above).
gitlab-runner unregister --url https://gitlab.org/ --token {TOKEN}
Use the GitLab API to unregister the runner
If you no longer have access to the machine which was used to register the runner, or if the runner is associated with multiple projects, you can use the following Python script. Set RUNNER_ID to the id of your runner (e.g. 354472 in the example above) and GITLAB_AUTH_TOKEN to a GitLab token which you can generate from your profile page.
import os
import requests
GITLAB_AUTH_TOKEN = ...
RUNNER_ID = ...
headers = {"PRIVATE-TOKEN": GITLAB_AUTH_TOKEN}
r = requests.get(f"https://gitlab.com/api/v4/runners/{RUNNER_ID}", headers=headers)
runner_data = r.json()
for project in runner_data.get("projects", []):
r = requests.delete(
f"https://gitlab.com/api/v4/projects/{project['id']}/runners/{RUNNER_ID}",
headers=headers,
)
if not r.ok:
print("Encountered an error deleting runner from project:", r.json())
r = requests.delete(f"https://gitlab.com/api/v4/runners/{RUNNER_ID}", headers=headers)
if not r.ok:
print("Encountered an error deleting runner:", r.json())
Here's one-liner to remove offline runners (for GitLab 14.5):
curl --header "PRIVATE-TOKEN: <private_token>" "https://<your-instance-address>/api/v4/runners/all?scope=offline&per_page=100" | jq '.[].id' | xargs -I runner_id curl --request DELETE --header "PRIVATE-TOKEN: <private_token>" "https://<your-instance-address>/api/v4/runners/runner_id"
You might run this more than once if you have more than 100 offline runners (per_page=100).
If you are talking about the runners listed in "Available group runners: ...", they can be deleted at the runner settings page of your group.
If you no longer have enough information related to a runner, GitLab (UI) will only allow you to disable it.
However, there is a workaround to delete runners via the GitLab UI (if you lost your info).
Create a new blank project within GitLab (called dummy, for instance)
Go to the CI/CD settings page (Settings -> CI/CD -> Runners)
Enable all runners you want to delete to be able to edit them
Lock every runner you wish to delete to the dummy project as shown below
Delete the dummy project
The runners are gone.
The overall idea was to lock all of the orphan runners to a dummy project, then delete that dummy.
PS: If runners are not visible in the dummy project, you may want to unlock them from the project they are associated with, then do the procedure again.
EDIT: This process is most particularly useful when
You do not have access to the machine host (especially in big organisations where rights are segmented), only to your GitLab instance.
You think that creating a runner via the UI should also give you the ability to delete a runner via the UI
You have enough rights but you don't want to fire up a Ruby instance (like described in the GitLab doc) to delete a runner.
With GitLab 15.5 (October 2022), you can also use the Web UI:
Bulk delete runners in the Admin Area
Bulk editing is a powerful and valuable feature when you need to visualize or manage large data sets. For administrators that manage a fleet of runners, the lack of a bulk delete option is a productivity drain and increases the operational overhead of maintaining runners.
Now, in the Admin Area, you can select multiple runners and delete them at the same time. You can also select and delete a full page of runners at once.
See Documentation and Issue.
You must make sure that you copy the value of thetoken=... entry from the config.toml file, or from the settings page.
Do not use the registration_token . The registration_token is different from the token.
In my case I had just created a runner, immediately realized that I had misconfigured the runner (or chosen the wrong executor), and wanted to delete it after first use:
This happened because I still had the gitlab CI/CD Settings webpage with the "Specific Runners // Shareed" Runners Section open and in focus.
I tried
# bad -long registration token
gitlab-runner unregister --url https://git.mycompany.de/ \
--token GR1348941LXUymFTPN5sdKFu1F5mQ`
#ERROR: Unregistering runner from GitLab forbidden runner=GR1348941LXUymFTP
#FATAL: Failed to unregister runner
# GOOD -shorter token from config.yml
gitlab-runner unregister --url https://git.mycompany.de/ \
--token N8Gsyebw_mpYnUBMKB25`
# Unregistering runner from GitLab succeeded runner=N8Gsyebw
If you've deleted the specific runner in your gitlab server, try to remove the unused runner through config.toml file (locally).
To show all runners:
$ gitlab-runner list
Or
$cat /Users/yourUser/.gitlab-runner/config.toml
If you try to delete a runner with this command:
$ gitlab-runner verify --delete -t Token-From-Your-Runner -u https://gitlab.com/
-> You'll have an error (Verifying runner... error) 'cause the process doesn't not match with your remote runner...
Then (To solve this trouble)
Delete all runners by the name with their indentation!
If you only have one, the file shows as:
concurrent = 1
check_interval = 0
[session_server]
session_timeout = 1800
[[runners]]

POSTMAN-NEWMAN: 403 Forbidden error message

1) Newman Version (can be found via newman -v): version is 4.6
2) OS details (type, version, and architecture): RHEL 7.4
3) Are you using Newman as a library, or via the CLI? Downloaded Newman via npm
4) Did you encounter this recently, or has this bug always been there: We are executing our postman collections through newman in Jenkins for the first time.
5) Expected behaviour: status 200
6) Command / script used to run Newman: HTTP_PROXY=http://xx.xx.xx.xx:xx HTTPS_PROXY=http://xx.xx.xx.xx:xx newman run collections.json --reporters junit,html,xml
We are trying to automate the execution of API test collections using Newman in Jenkins. The collections are executing properly in postman, but we are getting 403 forbidden when we execute them in Jenkins. We are getting outputs like these:
GET http://xx.xx.xxx.xx:xxxx/api/add-lead/ [403 Forbidden, 2.69KB, 133ms]
There was an error running your collection: tunneling socket couldn't be established, statusCode=403
We have installed Jenkins and necessary capabilities to execute postman collections in a Jenkins agent. Jenkins, agent, the IPs mentioned in GET, PUT commands are in our bank's internal network.
What could be the reason because we have scanned through similar issues, but found no satisfactory answers.
Please tell me if you need any other details.
Regards

nutch crawl using protocol-selenium with phantomjs launched as a Mesos task : org.openqa.selenium.NoSuchElementException

I am trying to crawl AJAX based sites with Nutch using protocol-selenium with the phantomjs driver. I am using apache-nutch-1.13 compiled from nutch's github repository. These crawls are launched as tasks in a system managed by Mesos. When I launch nutch's crawl script from a terminal in the server everything goes perfect and the site is crawled as I asked. However, when I execute the same crawl script with the same parameters inside a Mesos task nutch raises the exception:
fetch of http://XXXXX failed with: java.lang.RuntimeException: org.openqa.selenium.NoSuchElementException: {"errorMessage":"Unable to find element with tag name 'body'","request":{"headers":{"Accept-Encoding":"gzip,deflate","Connection":"Keep-Alive","Content-Length":"35","Content-Type":"application/json; charset=utf-8","Host":"localhost:12215","User-Agent":"Apache-HttpClient/4.3.5 (java 1.5)"},"httpVersion":"1.1","method":"POST","post":"{\"using\":\"tag name\",\"value\":\"body\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/a7f98ec0-b8aa-11e6-8b84-232b0d8e1024/element"}}
My first impression was that there was something strange with the environmental variables (HADOOP_HOME, PATH, CLASSPATH...) but I put the same vars in the nutch script and in the terminal and still the same result.
Any ideas about what I am doing wrong?

Xvfb, Jenkins, Selenium tests - Capture Screenshots of all pages

I'm trying to find some clues on the following issues and not able to find good help online.
I'm running Xvfb (X virtual frame buffer), firefox on a Linux machine in headless mode. Xvfb main service is up and running and DISPLAY variable is set.
/usr/bin/Xvfb :99 -ac -screen 0 1600x1200x16
I have some automated selenium based tests which I'm running using Gradle (gradle test). They run successfully and in Jenkins I'm able to get this working using Xvfb plugin. JUnit post publish report/result info and Gradle's reports/test/index.html file is showing successful test run.
I just run the following to run tests in Gradle:
gradle test -DsomePropConfigFileForEnv=SomeSourceConfigFilewithPathvalue
My questions:
1. How can I get the screenshots of all the pages that this automated test/run is rendering (i.e. login page, application main page after login, user clicks on the main page here and there (i.e. opening/clicking on various tabs, links, tables, buttons etc) and finally log out page.
I'm able to get the screenshot from the Xvfb_screen<N> file, which is getting created under -fbdir folder (what we specify while running Xvfb via a Jenkins job) but the screenshot is a Black page if test runs successfully (this can be due to the 2nd bullet I mentioned below) --OR it's a valid single page image screenshot (if an error is encountered during the test run).
I'm trying to get all the pages which the automated Selenium tests are rendering (the config file I passed to Gradle as a -D parameter has URLs / user name / browser, version etc info in it). PS: It's not just for some random URL that I'm trying to get an image screenshot using Xvfb DISPLAY virtual frame buffer.
During the test, I see there's a valid virtual framebuffer file, with a valid size.
For ex: While Jenkins job is in progress and running Gradle test task and Xvfb plugin has started a new xvfb instance, I see:
/production/JSlaves/kobaloki2_1/xvfb-2015-02-04_01-16-37-6170319257811815857.fbdir/Xvfb_screen0
but as soon as the test is complete (or errors our), this file is getting deleted from this xxxx.fbdir folder and there's no file at all.
Why is this file getting deleted.
If it'll remain there, then I can use xwd/xwud command and other tools (imagemagick convert etc commands) to create an image file as a POST BUILD action or even within the BUILD section after "Invoke Gradle" step.
The following command will create a .png image file of the firefox screenshot (only one page screenshot) and assuming xvfb is running on DISPLAY=:107
xwd -root -display :107 | convert xwd:- /tmp/capture2.png
and the following xvfb process (which is still running, containing a valid Xvfb_screen**** file in it - which was created by the Jenkins job where Xvfb plugin is configured with offset base 100 and 7 is the node/build number thus, making :107 as DISPLAY number).
u10002 30717 19950 1 01:16 ? 00:00:00 Xvfb :107 -screen 0 1024x768x8 -fbdir /production/JSlaves/kobaloki2_1/xvfb-2015-02-04_01-16-37-6170319257811815857.fbdir
I'm not running Xvfb / Imagemagick etc to just get an image of a URL (ex: www.google.com) but trying to get all the screenshots what a test is rendering behind Xvfb memory virtual framebuffer/file during the test run.
Are there any other tools (simple enough to install without messing up with the Linux server) which can achieve the same (capturing screenshots of all the pages that a test is rendering behind Xvf/firefox/Linux server in Headless way)?
I also tried Selenium Grid server, but FF is acting up there (due to some reason) thus I'm trying to run these tests using Jenkins, Gradle, Xvfb plugin on a Linux server (Headless mode) using firefox browser and planning to have N no. of executors to run multiple runs of these tests and finally capturing the results per run.
I'm archiving the artifacts (if any) and using Image Gallery plugin as well, but don't have the images for all the rendered pages which ran in Selenium behind Xvfb/firefox.
Any inputs are greatly appreciated.
Thanks.
If you're running with Selenium then you could use driver.getScreenshotAs()
http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp
Set this at the end of a step or method where you want a screenshot and output it to disc.
OK, this is what I did. This approach doesn't require any change to the source code of the project.
Installed imagemagick (..ck) i.e. yum install imagemagick on RHEL.
Created a script on the target server and it works now. All I do is, in the Jenkins job, when I have already started the Xvfb instance (using Xvfb plugin in Jenkins), then just a second before before running the Selenium GUI tests via Gradle (or any build tool), I call the following script and pass the parameters (where DISPLAY variable value is available to the Jenkins job as we are using Xvfb plugin in it). At the end of tests, the script exists automatically (as xwd command doesn't get any more input so it exits gracefully) and finally I publish the images and .mp4 (video) file on Jenkins (as a side bar link to show Test results / video) and archive the artifacts (.png image files using "Image Galary Plugin" and .mp4 file).
NOTE: This requires that your machine has: imagemagick, xwd and ffmpeg installed. If the options passed to any commands differs acc. to your OS machine, then tweak it accordingly. The framerate value in ffmpeg command can be a fraction i.e. 1/5 or 0.5 or 15 or anything you want (try it and see what you get).
It's up to you, if you want to ARCHIVE this big amount of data or not. You can do it if you have good space and if your Jenkins job have a better old build clean retention policies.
#!bin/bash
##
## This script will capture Screenshot (every 0.1 seconds) of an automated GUI (for ex: Selenium tests) tests running behind a HEADLESS Xvfb display instance.
## Then, it'll create a mp4 format movie using the captured screenshots.
##
## Machine where you run this script, should have: Xvfb service running, a session started by Xvfb plugin via Jenkins, xwd,ffmpeg OS commands and imagemagick (utilities).
## - For ex, try this on RHEL to install imagemagick: yum install imagemagick
##
## Variables
ws=$1; ## Workspace folder location
d=$2; d=$(echo $d | tr -d ':'); ## Display number associated with the Xvfb instance started by Xvfb plugin from a Jenkins job
wscapdir=${ws}/capturebrowserss; ## Workspace capture browser's screen shot folder
if [[ -n $3 ]]; then wscapdir=${wscapdir}/$3; fi ## If a user pass a 3rd parameter i.e. a Jenkins BUILD_NUMBER, then create a child directory with that name to archive that specific run.
i=1;
rm -fr ${wscapdir} 2>/dev/null || ( echo - Oh Oh.. Cant remove ${wscapdir} folder; echo -e "-- Still exiting gracefully! \n"; exit 0);
mkdir -p ${wscapdir}
while : ; do
xwd -root -display :$d 2>/dev/null | convert xwd:- ${wscapdir}/capFile_${d}_dispId`printf "%08d" $i`.png 2>/dev/null;
if [[ ${PIPESTATUS[0]} -gt 0 || ${PIPESTATUS[1]} -gt 0 ]]; then echo -e "\n-- Something bad happened during xwd or imagemagick convert command, manually check it.\n"; exit 0; fi
((i++)); sleep 0.1;
done
ffmpeg -r 5 -i ${wscapdir}/capFile_dispId_%08d.png ${wscapdir}/out_byRateOf5.mp4 2>/dev/null || echo -e "\n-- Some error occurred (may be too many files opened), exiting gracefully!\n";