Can I download files with Selenium? - selenium

I've managed to locate the correct element on the web page and I can click it. I see the download prompt come up, asking me whether I should open the pdf, open it in Firefox, or save it.
My code looks like this:
my $profile = Selenium::Firefox::Profile->new;
$profile->set_preference(
"browser.download.dir" => "/Users/john/Downloads",
"browser.download.folderList" => "1",
"browser.helperapps.neverAsk.SaveToDisk" => "application/pdf"
);
my $driver = Selenium::Firefox->new(
'binary' => "/usr/local/bin/geckodriver",
'firefox_profile' => $profile
);
[...]
$driver->find_child_element($driver->find_element_by_id('secHead_1'), "./a[\#class='phoDirFont']")->click();
My understanding is that if I've set up the correct preferences, then this file would save without the prompt. That's not happening though. I've dug down into it with dev tools, and it does seem to be serving up the pdf file with "application/pdf" as the mime-type. Firefox certainly recognizes it as one (offering to open it in Firefox, not just with the registered app).
If there is another method (perhaps by sending keystrokes to the prompts), that would be acceptable too. Though I've been using Firefox (trying to get away from Google products in my personal life), if Chrome would make a difference I could switch to that instead.
Looking at my about:config (with the Selenium-opened window), the preferences settings seem to have taken. However, it still prompts for the file.

According to https://www.selenium.dev/documentation/en/worst_practices/file_downloads/ file downloads are not supported in selenium.
You could try using curl instead eg:
system("curl -s -o output.pdf -L $URL");
where -s stands for silent, -o stands for where to save the file and -L tells curl to follow redirect
If you need cookies you can obtain them with:
my #cookies = $driver->get_all_cookies();
extract whichever cookies you need and then pass them to curl with a --cookie parameter, like so:
system("curl -s -o output.pdf -L --cookie $cookie_one --cookie $another_cookie $URL");

Related

Converting docx to PDF/A with libre office writer

I am happily converting docx files to PDF via the command line (controlled via C# process calls) out of my service.
Unfortunately I could not find any internet search results on how to set the options for the output PDF that the GUI offers me. I am specifically looking for generating PDF/A and tagged PDF via the command line.
Anyone ever done this and knows how to do that?
EDIT:
Obviously getting a PDF/A can be done by using unoconv instead.
On windows one would use the following command line in a checked out unoconv repository:
python.exe .\unoconv -f pdf -eSelectPdfVersion=1 C:\temp\libre\renderingtest.docx
I did not find further information on how to select other things (tagged PDF etc.) and where to get a complete list of the options that are available.
EDIT: It seems as one could try the different options in the GUI. The settings get saved to C:\Users\<userName>\AppData\Roaming\LibreOffice\4\user\registrymodifications.xcu. Then one can look up the changed setting and provide that to unoconv as this:
python.exe .\unoconv -f pdf -eUseTaggedPDF=1 -eSelectPdfVersion=1 C:\temp\libre\renderingtest.docx
Still not sure if I am doing this correctly though.
The gotenberg project shows how that can be done using unocov.
$ curl --request POST 'http://localhost:3000/forms/libreoffice/convert' --form 'files=#"doc.docx"' --form 'nativePdfFormat="PDF/A-1a"' -o pdfA.pdf
Example PDF

Cant start linux "screen" with logging to specific output file

I have the problem that I want to enable logging of a screen session at the start of it which then saves the log to a specific file.
What I have until now was:
screen -AmdSL cod2war /home/cod2server/scripts/service_28969.sh
while service_28969.sh is a shell script that will call other scripts which produce output.
I started multiple of those screen-sessions with different names, for example
screen -AmdSL cod2sd /home/cod2server/scripts/service_28962.sh
-L enables logging as the screen's man say, and will safe the ouput in a file called 'screenlog.0', now since I have multiple of those screens only one of it produces output saved in that log file (I can't find other 'screenlog.*' files in that folder).
I thought to use the -Logfile "file" option from the same man page, but it doesn't work for me and I can't find out what I'm doing wrong..
screen -Logfile cod2sd.log -AmdS cod2sd /home/u268450/cod2server/scripts/service_28962.sh
will produce the following error:
Use: screen [-opts] [cmd [args]]
or: screen -r [host.tty]
Options:
[...]
Error: Unknown option Logfile
and
screen -AmdS cod2sd /home/u268450/cod2server/scripts/service_28962.sh -Logfile cod2sd.log
will run without any error and start the screen but without the logging at all..
You can specify a logfile from within the default startup ~/.screenrc file using a line like
logfile mylog.log
To do this from the command line you can create a file mystartup to hold the above line, then use option -c mystartup to tell screen to read this file for setup instead of the default. If you also need to have ~/.screenrc read, you can add the source command to your startup file. The final result would look something like:
echo 'logfile mylog.log
source ~/.screenrc' >mystartup
screen -AmdSL cod2war -c mystartup /home/cod2server/scripts/service_28969.sh
This works for me:
screen -L -Logfile /Logs/Screen/`date +%Y%m%d`_screen.log
The configs I checked:
screen version 4.08.00 (GNU) 05-Feb-20 on FreeBSD 12.2
and
version 4.06.02 (GNU) 23-Oct-17 on Debian GNU/Linux 10 (buster)
and
version 4.00.03 (FAU) 23-Oct-06 on Mac OS X 10.9.5.
I just ran into this error myself and found this solution that worked with my python file, wanted to share for anyone else who might run into this issue:
screen -L -Logfile LOGFILENAME.LOG -dmS SCREENNAME python3 ./FILENAME.PY
I have no idea if this is the 'correct' way but it works.
-L enables logging
-Logfile LOGFILENAME.LOG declares what to call the log file and file format
-dmS SCREENNAME, dm runs in detached mode and S allows you to name the session
python3 ./FILENAME.PY in this case is my script but I assume that any other script here functions
I have tried a different ordering of these commands and this was the only way I managed to have them all run without issues. Hopes this helps.

wget downloading only PDFs from website

I am trying to download all PDFs from http://www.fayette-pva.com/.
I believe the problem is that when hovering over the link to download the PDF chrome shows the URL in the bottom left hand corner without a .pdf file extension. I saw and used another forum answer similar to this but the .pdf extension was used for the URL when hovering over the PDF link with my cursor. I have tried the same code that is in the link below but it doesn't pick up the PDF files.
Here is the code I have been testing with:
wget --no-directories -e robots=off -A.pdf -r -l1 \
http://www.fayette-pva.com/sales-reports/salesreport03-feb-09feb2015/
I am using this on a single page of which I know that it has a PDF on it.
The complete code should be something like
wget --no-directories -e robots=off -A.pdf -r http://www.fayette-pva.com/
Related answer: WGET problem downloading pdfs from website
I am not sure if downloading the entire website would work and if it wouldn't take forever. How do I get around this and download only the PDFs?
Yes, the problem is precisely what you stated: The URLs do not contain regular or absolute filenames, but are calls to a script/servlet/... which hands out the actual files.
The solution is to use the --content-disposition option, which tells wget to honor the Content-Disposition field in the HTTP response, which carries the actual filename:
HTTP/1.1 200 OK
(...)
Content-Disposition: attachment; filename="SalesIndexThru09Feb2015.pdf"
(...)
Connection: close
This option is supported in wget at least since version 1.11.4, which is already 7 years old.
So you would do the following:
wget --no-directories --content-disposition -e robots=off -A.pdf -r \
http://www.fayette-pva.com/

How to download CSV file with poltergeist using Capybara on phantomjs?

For a integration test, I need to download a CSV file using poltergeist driver with Capybara. In selenium(for example firefox/chrom webdriver), I can specify download directory and it works fine. But in poltergeist, is there a way to specify the download directory or any special configuration?. Basically I need to know how download stuff works using poltergeist,Capybara, Phantomjs.
I can read server response header as Hash using ruby but can not read the server response to get the file content.Any clue? or help please.
Finally I solved the download part by simply using CURL inside Ruby code without using any webdriver. The idea is simple, first of all, I submitted the login form via CURL and saved the cookie into my server and then submitted(via CURL) the CVS Export form using the saved cookie like this
post_data = "p1=d1&p2=d2&p3=d3"
`curl -c cookie.txt -d "userName=USERNAME&password=PASSWORD" LOGIN SUBMIT_URL`
csv_data = `curl -X POST -b cookie.txt -d '#{post_data}' SUBMIT_URL_FOR_DOWNLOAD_CSV`

Setting up JS debugging with IntelliJ/WebStorm and PhantomJS/Casper

Can I get an interactive JS debugger working on PhantomJS and/or CasperJS?
I didn't solve this entirely, but I definitely reduced the pain.
PhantomJS provides a command line argument to enable webkit's remote debugger. AFAIK, PhantomJS launches a server and dumps the script into the <head> of a webpage with the familiar in-browser debugger. It's actually pretty nice, with breakpoints, etc. However, switching to manually digging around in the terminal for a random command line parameter and the path to your script is seriously irritating.
So, I used IntelliJ's "external tools" feature to launch a Bash script that kills any previous debugging session, launches PhantomJS, and then opens the page up in Chrome.
#!/bin/bash
lsof -i tcp#0.0.0.0:9000 #list anything bound to port 9000
if [ $? -eq 0 ] #if something was listed
then
killall 'phantomjs'
fi
/usr/local/Cellar/phantomjs/2.0.0/bin/phantomjs --remote-debugger-port=9000 $1 &
# --remote-debugger-autorun=yes <- use if you have added 'debugger;' break points
# replace $1 with full path if you don't pass it as a variable.
sleep 2; #give phantomJS time to get started
open -a /Applications/Google\ Chrome.app http://localhost:9000 & #linux has a different 'open' command
# alt URL if you want to skip the page listing
# http://localhost:9000/webkit/inspector/inspector.html?page=1
#see also
#github.com/ariya/phantomjs/wiki/Troubleshooting
The next few lines are settings for IntelliJ, although the above code works just as well on any platform/IDE.
program: $ProjectFileDir$/path/to/bash/script.sh
parameters: $FilePath$
working dir: $ProjectFileDir$
PhantomJS has a remote-debugger-port option you can use to debug your casper script in Chrome dev tools. To use it, simply execute your casper script with this argument:
casperjs test script.js --remote-debugger-port=9000
Then, open up http://localhost:9000 in Chrome and click on the about:blank link that presents itself. You should then find yourself in familiar Chrome dev tools territory.
Since this is a script and not a web page, in order to start debugging, you have to do one of two things before your script will execute:
In the Chrome dev tools page, open the console and execute __run() to actually start your script.
Insert a debugger; line in your code, and run your casper script with an additional --remote-debugger-autorun=yes argument. Doing so with the remote debug page open will run the script until it hits your debugger; line.
There's a great tutorial that explains this all very nicely.