Unable create Screenshot of Craiglist with PhantomJS on Digital Ocean - phantomjs

Have a very strange problem. Since 1-2 weeks I am unable to create screenshots of a Craigslist pages with PhantomJS on Digital Ocean. Does anybody have any idea what the issue could be and why it does not work anymore?
It always worked fine before and still works totally fine when I run it locally on my notebook. It creates the screenshot within 2-4 seconds. However running the same (no matter if in a special Docker-Container like the one used locally or installed directly on the host) on Digital Ocean keeps on loading forever and if I am lucky and wait long enough I get the screenshot after around 7-10+ minutes.
Tried it on different Droplets (existing & totally new) in different zones (SF & Frankfurt) but always have the same issue. Already contacted Digital Ocean about that. They were able to reproduce the issue but according to them nothing changed on their side and have so also no idea what could cause that. They blame PhantomJS or Craigslist.
It can be reproduced very easily. On a new Droplet (Ubuntu 14.04) the following code will install PhantomJS:
# Install dependencies
sudo apt-get install -y libicu52 libjpeg8 libfontconfig libwebp5
# Install PhantomJS
cd /usr/local/share && \
curl -L -O https://github.com/bprodoehl/phantomjs/releases/download/v2.0.0-20150528/phantomjs-2.0.0-20150528-u1404-x86_64.zip && \
unzip phantomjs-2.0.0-20150528-u1404-x86_64.zip && \
ln -s /usr/local/share/phantomjs-2.0.0-20150528/bin/phantomjs /usr/local/bin/phantomjs
A very basic example script to create a screenshot of a product on craigslist. File called "test-screenshot.js" with this content:
var page = require('webpage').create();
var url = 'http://vancouver.craigslist.ca/van/ctd/5148995470.html';
page.open(url, function() {
page.render('craigslist.png');
phantom.exit();
});
To run the script: "phantomjs test-screenshot.js".
Thanks!

It appears that Craigslist is deliberately slowing traffic from Digital Ocean IPs. In the DC area, the CL homepage takes > 55 seconds to load when I use my VPN hosted at Digital Ocean. Without the VPN, load time is normal (< 1000 ms). Other sites work properly through the VPN.
I assume this is a technical response to their recent lawsuits involving scrapers of their site.

Related

Python cron job with Chrome not running in AWS EC2

I've been using an EC2 instance to run a python script with cron everyday for a month or so. The script uses selenium.
Everything was working correctly until today, when my script did not run.
I have tried to run it manually but it's not working either. The error message says that
raise exception_class(message, screen, stacktrace) selenium.common.exceptions.
NoSuchElementException: Message: no such element: Unable to locate element:
{"method":"cssselector","selector":"#ctl00_ctl00_moteurRapideOffre_
ctl01_EngineCriteriaCollection_Contract > option:nth-child(5)"}
(Session info: headless chrome=90.0.4430.85)
However, the same script is running fine on my computer (ie on my Macbook, not on AWS EC2).
As the problem seems to come from Chrome, I uninstalled it on AWS EC2 using:
sudo yum remove google-chrome-stable
Then I reinstalled it using :
curl https://intoli.com/install-google-chrome.sh | bash
sudo mv /usr/bin/google-chrome-stable /usr/bin/google-chrome
google-chrome --version && which google-chrome
If I try to run Chrome on the EC2 using /usr/bin/google-chrome, it does not work and it displays the following error message :
ERROR:browser_main_loop.cc(1386)] Unable to open X display.
I don't know if it was working before as I have never used it this way. But it seems to be a problem.
I have seen on the web that it might come from the fact that there is no screen and that I should use a package named xvfb. I have tried to install it with the following code:
sudo yum install xorg-x11-server-Xvfb
I guess the package was correclty installed, but it is not working better.
To sum up, I think my problem in the python code is linked to the fact that Google Chrome is not working correclty and this might be linked to xvfb. But I am not sure at all, it is just what I have tried until now.
Could you please help me ? Thanks!
You can simply add setup your like this, runs after every 30 minutes
*/30 * * * * export DISPLAY=:0 && ,<do what ever you want.>
If this does not work, and you google-chrome or firefox not found, simply run the command below in your shell BASH, FISH, ZSH etc to get PATH.
echo $PATH
Whatever the result comes out from the above command just copy and paste it above your cronjob like this,
*/30 * * * * export DISPLAY=:0 && ,<your selenium script.>```
You can remove export ```export DISPLAY=:0``` line if you want to this in the background or make your driver headless.
The reason of doing this, you might install the respective from snapd etc and that's why path is not defined as you downloaded from separate resource.

How do you make Chromium command line switches on a Chromebook?

I recently saw that ChromeOS added the functionality to do split screen windows in tablet mode in the most recent dev releases. So I put my Chromebook R11 in dev mode for the first time and updated to version 62.
The flag is one of the many on this list https://peter.sh/experiments/chromium-command-line-switches/
The only resources for actually executing these switches was http://www.chromium.org/for-testers/command-line-flags
So I tried following the steps. I went to the crosh shell with Ctrl-Alt-T. Then I typed "shell". Then "sudo su". Then I tried to modify with "sudo vim /etc/chrome_dev.conf", but it was readonly so it didn't save.
So I visited here www dot chromium dot org/chromium-os/poking-around-your-chrome-os-device and followed the steps to making changes to the filesystem and disabling rootfs verification. But the command it told me to enter just gave me an error: "make_dev_ssd.sh: ERROR: IMAGE /dev/mmcblk0 IS NOT MODIFIED."
I'm running out of ideas and resources here..
make_dev_ssd.sh is how you disable rootfs verification and modify files in the rootfs. If that's not working, that might be a bug in that script that should be reported & fixed upstream (e.g. https://crbug.com/new).
That said, are you sure you need to pass a command line flag ? Look at chrome://flags and see if the feature you want to access is available there. Many command line flag is also available on that page.
Try this:
sudo su
cp /etc/chrome_dev.conf /usr/local/
mount --bind /usr/local/chrome_dev.conf /etc/chrome_dev.conf
echo "--arc-availability=officially-supported " >> /etc/chrome_dev.conf

Xvfb, Jenkins, Selenium tests - Capture Screenshots of all pages

I'm trying to find some clues on the following issues and not able to find good help online.
I'm running Xvfb (X virtual frame buffer), firefox on a Linux machine in headless mode. Xvfb main service is up and running and DISPLAY variable is set.
/usr/bin/Xvfb :99 -ac -screen 0 1600x1200x16
I have some automated selenium based tests which I'm running using Gradle (gradle test). They run successfully and in Jenkins I'm able to get this working using Xvfb plugin. JUnit post publish report/result info and Gradle's reports/test/index.html file is showing successful test run.
I just run the following to run tests in Gradle:
gradle test -DsomePropConfigFileForEnv=SomeSourceConfigFilewithPathvalue
My questions:
1. How can I get the screenshots of all the pages that this automated test/run is rendering (i.e. login page, application main page after login, user clicks on the main page here and there (i.e. opening/clicking on various tabs, links, tables, buttons etc) and finally log out page.
I'm able to get the screenshot from the Xvfb_screen<N> file, which is getting created under -fbdir folder (what we specify while running Xvfb via a Jenkins job) but the screenshot is a Black page if test runs successfully (this can be due to the 2nd bullet I mentioned below) --OR it's a valid single page image screenshot (if an error is encountered during the test run).
I'm trying to get all the pages which the automated Selenium tests are rendering (the config file I passed to Gradle as a -D parameter has URLs / user name / browser, version etc info in it). PS: It's not just for some random URL that I'm trying to get an image screenshot using Xvfb DISPLAY virtual frame buffer.
During the test, I see there's a valid virtual framebuffer file, with a valid size.
For ex: While Jenkins job is in progress and running Gradle test task and Xvfb plugin has started a new xvfb instance, I see:
/production/JSlaves/kobaloki2_1/xvfb-2015-02-04_01-16-37-6170319257811815857.fbdir/Xvfb_screen0
but as soon as the test is complete (or errors our), this file is getting deleted from this xxxx.fbdir folder and there's no file at all.
Why is this file getting deleted.
If it'll remain there, then I can use xwd/xwud command and other tools (imagemagick convert etc commands) to create an image file as a POST BUILD action or even within the BUILD section after "Invoke Gradle" step.
The following command will create a .png image file of the firefox screenshot (only one page screenshot) and assuming xvfb is running on DISPLAY=:107
xwd -root -display :107 | convert xwd:- /tmp/capture2.png
and the following xvfb process (which is still running, containing a valid Xvfb_screen**** file in it - which was created by the Jenkins job where Xvfb plugin is configured with offset base 100 and 7 is the node/build number thus, making :107 as DISPLAY number).
u10002 30717 19950 1 01:16 ? 00:00:00 Xvfb :107 -screen 0 1024x768x8 -fbdir /production/JSlaves/kobaloki2_1/xvfb-2015-02-04_01-16-37-6170319257811815857.fbdir
I'm not running Xvfb / Imagemagick etc to just get an image of a URL (ex: www.google.com) but trying to get all the screenshots what a test is rendering behind Xvfb memory virtual framebuffer/file during the test run.
Are there any other tools (simple enough to install without messing up with the Linux server) which can achieve the same (capturing screenshots of all the pages that a test is rendering behind Xvf/firefox/Linux server in Headless way)?
I also tried Selenium Grid server, but FF is acting up there (due to some reason) thus I'm trying to run these tests using Jenkins, Gradle, Xvfb plugin on a Linux server (Headless mode) using firefox browser and planning to have N no. of executors to run multiple runs of these tests and finally capturing the results per run.
I'm archiving the artifacts (if any) and using Image Gallery plugin as well, but don't have the images for all the rendered pages which ran in Selenium behind Xvfb/firefox.
Any inputs are greatly appreciated.
Thanks.
If you're running with Selenium then you could use driver.getScreenshotAs()
http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp
Set this at the end of a step or method where you want a screenshot and output it to disc.
OK, this is what I did. This approach doesn't require any change to the source code of the project.
Installed imagemagick (..ck) i.e. yum install imagemagick on RHEL.
Created a script on the target server and it works now. All I do is, in the Jenkins job, when I have already started the Xvfb instance (using Xvfb plugin in Jenkins), then just a second before before running the Selenium GUI tests via Gradle (or any build tool), I call the following script and pass the parameters (where DISPLAY variable value is available to the Jenkins job as we are using Xvfb plugin in it). At the end of tests, the script exists automatically (as xwd command doesn't get any more input so it exits gracefully) and finally I publish the images and .mp4 (video) file on Jenkins (as a side bar link to show Test results / video) and archive the artifacts (.png image files using "Image Galary Plugin" and .mp4 file).
NOTE: This requires that your machine has: imagemagick, xwd and ffmpeg installed. If the options passed to any commands differs acc. to your OS machine, then tweak it accordingly. The framerate value in ffmpeg command can be a fraction i.e. 1/5 or 0.5 or 15 or anything you want (try it and see what you get).
It's up to you, if you want to ARCHIVE this big amount of data or not. You can do it if you have good space and if your Jenkins job have a better old build clean retention policies.
#!bin/bash
##
## This script will capture Screenshot (every 0.1 seconds) of an automated GUI (for ex: Selenium tests) tests running behind a HEADLESS Xvfb display instance.
## Then, it'll create a mp4 format movie using the captured screenshots.
##
## Machine where you run this script, should have: Xvfb service running, a session started by Xvfb plugin via Jenkins, xwd,ffmpeg OS commands and imagemagick (utilities).
## - For ex, try this on RHEL to install imagemagick: yum install imagemagick
##
## Variables
ws=$1; ## Workspace folder location
d=$2; d=$(echo $d | tr -d ':'); ## Display number associated with the Xvfb instance started by Xvfb plugin from a Jenkins job
wscapdir=${ws}/capturebrowserss; ## Workspace capture browser's screen shot folder
if [[ -n $3 ]]; then wscapdir=${wscapdir}/$3; fi ## If a user pass a 3rd parameter i.e. a Jenkins BUILD_NUMBER, then create a child directory with that name to archive that specific run.
i=1;
rm -fr ${wscapdir} 2>/dev/null || ( echo - Oh Oh.. Cant remove ${wscapdir} folder; echo -e "-- Still exiting gracefully! \n"; exit 0);
mkdir -p ${wscapdir}
while : ; do
xwd -root -display :$d 2>/dev/null | convert xwd:- ${wscapdir}/capFile_${d}_dispId`printf "%08d" $i`.png 2>/dev/null;
if [[ ${PIPESTATUS[0]} -gt 0 || ${PIPESTATUS[1]} -gt 0 ]]; then echo -e "\n-- Something bad happened during xwd or imagemagick convert command, manually check it.\n"; exit 0; fi
((i++)); sleep 0.1;
done
ffmpeg -r 5 -i ${wscapdir}/capFile_dispId_%08d.png ${wscapdir}/out_byRateOf5.mp4 2>/dev/null || echo -e "\n-- Some error occurred (may be too many files opened), exiting gracefully!\n";

Vagrant stuck in "Waiting for VM to Boot"

I want to preface this question by mentioning that I have indeed looked over most if not all vagrant "Waiting for VM to Boot" troubleshooting threads:
Things I've tried include:
vagrant failed to connect VM
https://superuser.com/questions/342473/vagrant-ssh-fails-with-virtualbox
https://github.com/mitchellh/vagrant/issues/410
http://vagrant.wikia.com/wiki/Usage
http://scotch.io/tutorials/get-vagrant-up-and-running-in-no-time
And more.
Here's how I setup my Vagrant:
Note: We are using Vagrant 1.2.2 since we do not at the moment have time to change configs to newer versions. I am also using VirtualBox 4.2.26.
My office has an /official/ folder which includes things such as Vagrantfile inside. Inside my Vagrantfile are these custom settings:
config.vm.box = "my_box"
config.ssh.private_key_path = "~/.ssh/github_rsa"
config.ssh.forward_agent = true
config.ssh.forward_x11 = true
config.ssh.max_tries = 300
config.vm.provision :shell, :inline => "/etc/init.d/networking restart"
I installed our custom box (called package.box) via vagrant box add my_box absolute_path/package.box which went without a hitch.
Running vagrant up, I would look at the "preview" of the VirtualBox, and it would simply be stuck at the login page. My Terminal would also only say: Waiting for VM to boot. This can take a few minutes. As far as I know, this is an SSH issue. Or my private key issues, though in my Vagrantfile I explicitly pointed to my private key location.
Interesting Notes:
Running dhclient within the VirtualBox GUI, it says command no found. Running sudo dhclient eth0 was one of the suggested fixes.
This fix: https://superuser.com/a/343775/298915 of "modify the /etc/rc.local file to include the line sh /etc/init.d/networking restart just before exit 0." did nothing to fix the issue.
Conclusion:
Having tried to re-install everything thinking I messed up a file, it did not seem to ameliorate the issue. I am unable to work with this issue. Could someone give me some insight?
So after around twelve hours of dejected troubleshooting, I was able to (finally) get the VM to boot.
Setup your private/public keys using the link provided. My box is a Debian Linux 3.2.0-4-amd64, so instead of /root/.ssh/id_rsa.pub, you have to use /home/vagrant/.ssh/id_rsa.pub (and the respective id_rsa path for the private key).
Note: make sure your files have the right permissions. Check using ls -l path, and change using chmod. Your machine may not have /home/vagrant/.ssh/authorized_keys, so generate that file with touch /home/vagrant/.ssh/authorized_keys.
Boot your VM using the VirtualBox GUI using (through either Vagrantfile boot-GUI command, or starting your VM using VirtualBox). Login using vagrant and vagrant when prompted.
Within the GUI, manually start dhclient using sudo dhclient eth0 -v. Why is it off by default? I have no idea. I found out that it was off when I tried to wget the private/public keys in the tutorial above, but was unable to.
Go to your local machine's command line and reload vagrant using vagrant reload. It should boot, and no longer hang at "Waiting for VM to Boot."
This worked for me. Though it may be different for other machines, for whatever reason Vagrant likes to break.
Suggestion: can this be saved as a script so we don't need to manually do this everytime?
EDIT: Update to the latest version of Vagrant, and you will never see this issue again. About time, huh?

Trying to make a Webkit Kiosk on Debian with Raspberry Pi

I'm trying to build a Webkit Kiosk on a Raspberry Pi.
I found a good start at: https://github.com/pschultz/kiosk-browser
The things I want to do:
1) Start the kiosk without logging in (with inittab?)
Peter Schultz pointed out adding the following line:
1:2345:respawn:/usr/bin/startx -e /usr/bin/browser http://10.0.0.5/zfs/monitor tty1 /dev/tty1 2>&1
But he did not explain the steps to make this work (for noobs).
What I did is add his code to a personal git repository and cloned this repo to /usr/bin/kiosk and sudo apt-get install libwebkit-dev and sudo make.
The line to add to inittab will be:
1:2345:respawn:/usr/bin/startx -e /usr/bin/kiosk/browser http://my-kiosk-domain.com tty1 /dev/tty1 2>&1
If I do this, I generate a loop or some kind...
If you want to automatically load a browser full screen in kiosk mode every time you turn on the rpi you can add one of these two lines to the file /etc/xdg/lxsession/LXDE/autostart
#chromium --kiosk --incognito www.google.it
#midori -i 120 -e Fullscreen -a www.google.it -p
The first is for chromium and the latter is for midori, the rpi default lightweight browser.
Hint : Since we will use the rpi as a kiosk we want to prevent the screen from going black and disable the screensaver. Edit the autostart file:
sudo pico /etc/xdg/lxsession/LXDE/autostart
find the following line and comment it using a # (it should be located at the bottom)
##xscreensaver -no-splash
and append the following lines
#xset s off
#xset -dpms
#xset s noblank
Save, reboot.
More info on
http://pikiosk.tumblr.com/post/38721623944/setup-raspberry-ssh-overclock-sta
The upvoted answer suggest to run LXDE for it. You could also do it without such a heaver desktop enviorment. You could just start midori or chromium in an X session:
xinit /usr/bin/midori -e Fullscreen -a http://www.examples.com/
xinit chromium --kiosk http://www.examples.com/
Sometimes Fullscreen mode of midori is not working as expected and midori is not using whole screen. In these cases you could map it inside a very simple window manager like MatchBox to get real fullscreen. Due to xinit you have to wrap everything in a shell script.
#!/bin/sh
matchbox-window-manager &
midori -e Fullscreen -a http://dev.mobilitylab.org/TransitScreen/screen/index/11
Autostart could be done simply be using /etc/rc.local.
More information concerning screensaver issues and an automated restart could be found here: https://github.com/MobilityLab/TransitScreen/wiki/Raspberry-Pi#running-without-a-desktop
Chromium has a dependency problem on some debian derivate for arm architecture. For Cubian you find the bug report here. I am not sure if you could install chromium on latest Raspbian without problem.
But I really could recommend midori. It's very fast and support for modern web technologies is very good. As Chromium it is using webkit as rendering engine. If you miss some html5 / css3 features consider an update of libwebkitgtk (for example by using package of debian testing).
It's possible you haven't set the DISPLAY environment variable.
Try:
export DISPLAY=:0
/usr/bin/startx /usr/bin/browser
Or, browser can also take a display argument (so you don't need the environment variable):
/usr/bin/startx /usr/bin/browser :0
This works for me on Raspbian from a standard terminal shell (I'm logged in over SSH).
Updated for the current version of Raspbian (with Pixel desktop) install with noop 2.0.
I found you need to edit in two different places to get it to work.
/etc/xdg/lxsession/LXDE/autostart
/home/pi/.config/lxsession/LXDE-pi/autostart
So my configure file is:
# #xscreensaver -no-splash
#xset s off
#xset -dpms
#xset s noblank
#chromium-browser --kiosk --incognito http://localhost
And that's it.
You should probably start with checking if /usr/bin/kiosk/browser is working at all. You should start normal X session (graphical environment) on your RaspberryPi, launch terminal, try running this command:
/usr/bin/kiosk/browser http://my-kiosk-domain.com
and see what it prints on the terminal. Is this working? Do you see any error messages?
I'm trying to build a Webkit Kiosk on a Raspberry Pi.
I think Instant WebKiosk for Raspberry Pi could be useful for you.
See: http://www.binaryemotions.com/raspberry-digital-signage/