Is there a way to download multiple PDF's that are linked on a website? - pdf

I am trying to download a bunch of PDF's from the federal reserve archives but I have to click on a link and then view the PDF before I can download. Is there a way to automate this?
Example: https://fraser.stlouisfed.org/title/5170#521653 is a link to speeches and then you have to click the title, then view pdf, then the actual download button.

All of the remote .pdf files follow the path format:
https://fraser.stlouisfed.org/files/docs/historical/frbatl/speeches/guynn_xxxxxxxx.pdf
where each x is a placeholder for a digit.
So, yes, it's very easy to download a bunch of these PDFs in one go using the command-line in Terminal or whatever shell program you have access to.
If you're in a *nix-based operating system (including MacOS), that's good because your shell probably already has a command utility called curl installed. Windows may have it too, I'm not sure; I don't use Windows.
If you're using Windows, you'll have to make some tweaks to the code below, because the folder structures and file naming conventions are different, so the first couple of commands won't work.
But, if you're happy to proceed, open up a Terminal window, and type in this command to create a new directory in your Downloads folder, into which the .pdf files will be downloaded:
mkdir ~/Downloads/FRASER_PDFs; cd ~/Downloads/FRASER_PDFs
Hit Enter. Next, If there's no error, copy-n-paste this long command and then hit Enter:
curl --url \
"https://fraser.stlouisfed.org/files/docs/historical/frbatl/speeches/guynn_{"$(curl \
https://fraser.stlouisfed.org/title/5170#521653 --silent \
| egrep -io -e '/files/docs/historical/frbatl/speeches/guynn_\d+\.pdf' \
| egrep -o -e '\d+' | tr '\n' ',')"}.pdf" -O --remote-name-all
You can see this uses the URL you supplied in your question, from which that command retrieves all the .pdf links. If you need to do the same with other similar pages, provided they all use the same URL format, you can just substitute 5170#521653 with whatever page reference contains another list of .pdfs.

Related

How to convert multiple PDF to Html webpage

I need to convert 800+pdf files into html webpage, and every pdf file had own page on html webpage.
I tried to make with Adobe Acrobat, but what i get was every pdf merged in one big list.
So is there any way to automatically do this?
You could use pdftohtml on Linux and make it loop through all the files in the directory.
You can also find more information about pdftohtml on this thread: How to convert PDF to HTML?
pdf2htmlEX
Preserves formatting of the PDF file
Only works through docker (On new builds of Linux, this package is not present and deb packages are not installed)
sudo docker pull bwits/pdf2htmlex
sudo docker run -ti --rm -v /home/user/Documents/pdfToHtml:/pdf bwits/pdf2htmlex pdf2htmlEX --zoom 1.3 file.pdf

wget not downloading pdfs in directories

Following problem: I'm trying to download a directory that contains pdfs, and it downloads the file structure, some of the pdfs but doesn't go deeper than the 2nd directory to download pdfs.
Details (theoretical)
So I have folder1/folder2/folder3(/folder4/folder5)
folder1 contains no pdfs, file structure contained in it, is downloaded.
folder 2 contains another folder and some pdfs, folders are created, pdfs are downloaded
folder 3 sometimes contains more folders, which are created but all pdfs contained in it and in the subfolders are not downloaded.
here is what I'm using to try to download all of it:
wget -r -l inf --no-remove-listing -np -c -w 3 --no-check-certificate -R "index.html*" -P "target directory" "https://etc./"
What am I doing wrong?
Solved it: -erobots=off was the solution. Which is weird since the site actually a wget command that I disagreed with, but still tried and had even less of an result than with my own commands, anyway -erobots=off was not mentioned in their orignally code so I figured and I didn't need it, but I did.

How to use gitbash instead of windows cmd.exe with meteor Release 0.7.0.1-win2

I am getting started with Meteorjs. I'm a windows user so I downloaded the windows installer package Release 0.7.0.1-win2. I use gitbash for my command line interface and can't get it to recognize meteor. I get the error "sh.exe": meteor: command not found". It works fine in windows command line but I prefer gitbash.
How do I get meteor to work with gitbash?
I have the perfect answer for you since I literally just solved the issue myself.
First of all make sure meteor works in the default windows command prompt. Next open git bash and check if the following command works:
cmd //c meteor
This runs the command meteor as if you were in the command prompt.
Next step is to set up an alias in git bash so you don't have to type that out each time.
Open git bash and enter the following:
vim ~/.bashrc
this will open/create the bashrc file in VIM, press i to insert and type the following:
alias meteor="cmd //c meteor"
Save and exit vim by first pressing the Esc key then press the ":" key. Now you should be able to enter commands in VIM. Type "wq" and press enter which will write into your .bashrc file and exit vim.
Almost there! Now that you are back in git bash, all you need to do is point to your .bashrc file by entering the following:
source ~/.bashrc
Now you will be able to run meteor commands straight from git bash! Hope that helped!
Here's the fix:
The problem is because of .bat files not being handled properly by
MinGW
Go to this directory - C:\Users[your username]\AppData\Local\.meteor
You should see a meteor.bat file there. Create a new file called "meteor" (without any extension and ""). Open it with notepad and paste the following:
#!/bin/sh
cmd //c "$0.bat" "$#"
save the file and now run git bash. You should be able to use meteor command in git bash.
Details
To run a *.bat command from MinGW's MSYS shell, you must redirect the execution to cmd.exe, thus:
cmd //c foo.bat [args ...]
The foo.bat command file must be in a directory within $PATH, (or you must specify the full path name ... using slashes, not backslashes unless you use two of them for each path name separator). Also, note the double slash to inform cmd.exe that you are using its /C option, (since it doesn't accept the -c form preferred by the MSYS shell.
If you'd like to make the foo.bat file directly executable from the MSYS shell, you may create a two line Bourne shell wrapper script called simply foo alongside it, (in the same directory as foo.bat), thus:
#!/bin/sh
cmd //c "$0.bat" "$#"
(so in your case, you'd create script file meteor alongside meteor.bat).
In fact, since this wrapper script is entirely generic, provided your file system supports hard file links, (as NTFS does for files on one single disk partition), you may create one wrapper script, and link it to as many command file names as you have *.bat files you'd like to invoke in this manner; (hint: use the MSYS ln command to link the files).
Credits to: Keith Marshall on SO and rakibul on Meteor Forums
It shouldn't be too hard - you just need to make sure that the meteor.bat file is in your executable. Check with echo $PATH from the bash console if it is already there.
For me, the meteor 0.7.0.1-win installer appended meteor's folder to the path automatically. However, you can add it manually with:
export PATH=$PATH:/path/to/user/folder/AppData/Local/.meteor
(On CygWin my user folder is at /cygdrive/c/Users/adam - I'm not sure what the equivalent path would be on git bash).
If you like, append that line to your ~/.profile to make sure meteor gets added to the path when the console opens.
Finally, on Windows the executable file is meteor.bat. I made a symbolic link to the filename meteor, just so I wouldn't have to type the .bat:
cd /path/to/user/folder/AppData/Local/.meteor
ln -s meteor.bat meteor.
Please have a look at the issue https://github.com/sdarnell/meteor/issues/18
I would suggest maybe creating a trivial wrapper script or alias that invokes LaunchMeteor.exe with the original arguments.
After more research on google I see that there isn't an implemented way to do this yet. The guys at meteor are working on it and accepting pull requests if you have a solution. The conclusion I came to is to use Vagrant and virtualbox to set up a ubuntu vm for meteor development. You can find info at this site: http://win.meteor.com/ on how to install virtual machines and provision to work with meteor.

excluding a directory from accurev using pop command

I have 10 directories in a AccuRev depot and don't want to populate one directory using "accurev pop" command. Is there any way? .acignore is not suiting to my requirements because in another jenkins build I need that folder. Just want to save time to avoid unnecessary populate of directories.
Any idea?
Thanks,
Sanjiv
I would create a stream off this stream and exclude the directories you dont want. Then you can pop this stream and only get the directories you want.
When you run the AccuRev populate command you can specify which directories to populate by specifying the directory name:
accurev pop -O -R thisDirectory
will cause the contents of thisDirectory to match the backing stream from the time of the last AccuRev update in that workspace.
The -O is for over write and the -R is for recurse. If you have active work in progress the -O will cause that work to be over written/destroyed.
The .acignore is only for (external) files and not those that are being managed by AccuRev.
David Howland

Effortless export from github wiki (?)

I am collecting quite a lot of material in a GitHub wiki. I really like to use the wiki to cooperate with other people and IMHO the platform is really nice, I like it!
So, I would like to keep using the GH wiki to collect stuff, edit, save,etc but I also would like to export the content in order to create a pdf file that we can call "a manual".
I would like to generate an updated version of the manual automatically everytime I want just running a couple of scripts, I can not put too much effort on this.
I guess it is possible to export the content somehow and the use pandoc (http://johnmacfarlane.net/pandoc/) to create the pdf maybe adding an index and a style file.
Another interesting idea could be publish a website once a month dumping content directly from the wiki.
I guess other people already did something like this but I did not find anynthing.
Any idea?
But... the Github wiki of a GitHub repo is a git repo in itself (introduced in August 2010).
You can clone it, push to it or pull from it.
Each wiki is a Git repository, so you're able to push and pull them like anything else.
Each wiki respects the same permissions as the source repository.
Just add ".wiki" to any repository name in the URL, and you're ready to go.
Or, as noted by htafoya in the comments, replace the .git part of the URL (if present) by .wiki.
That makes the "export" part of your question really trivial.
From there, you will find tons of script for converting markdown pages into pdf:
a graddle task
a makefile
a python script
...
I'm adding to this answer, in case it helps any new readers :) here's what I did:
I installed GitHub Desktop: https://desktop.github.com/
Then, on the wiki page in my repository, I clicked "Clone in Desktop"
This saved the wiki locally as a .md file (after following the steps on screen)
I then used http://www.markdowntopdf.com/ to convert it to pdf
(Note: I renamed the files to remove characters that wouldn't work in a pdf file name before uploading to the website)
The end result was really nice.
I found many of the solutions difficult to reproduce/get the right version/understand/fix/etc... So instead, I'll present a patchwork docker solution to effortlessly convert on Windows(using git bash)/MacOS/Linux in 5 "easy" commands
git clone {project_url}.wiki .
# Convert *.md to *.md.html using the actual github pipeline
docker run --rm -e DOCKER_USER_ID=`id -u` -e DOCKER_GROUP_ID=`id -u` \
v "`pwd`:/src" -v "`pwd`:/out" andyneff/github-markdown-preview
# Fix hyperlinks, since wkhtmltopdf is stricter than github servers
docker run --rm -v `pwd`:/src -w /src perl \
perl -p -i -e 's|(.*?)|\1\L\2\E.md.html\L\3\E\4|g'\
*.html
# Lowercase all filename so that hyperlink match
docker run --rm -v `pwd`:/src -w /src python \
python -c 'import sys;import os; [os.rename(f, f.lower()) for f in sys.argv[1:]]' \
*.md.html
#Convert html to pdf using QT webkit
docker run -it --rm -e DOCKER_USER_ID=`id -u` -e DOCKER_GROUP_ID=`id -u`\
-v `pwd`:/work -w /work andyneff/wkhtmltopdf \
wkhtmltopdf --encoding utf-8 --minimum-font-size 14 \
--footer-left "[date]" --footer-right "[page] / [topage]" \
--footer-font-size 10 \
toc \
*.html document.pdf
The perl is the main part that may fail without a better solution. Pandoc has a really good filter solution, but isn't using the github pipeline.
Bugs:
Extra wide code blocks will be rendered with a scroll bar, and essentially cut off in the pdf. It would be best to make the code block not overflow, but you can add --user-style-sheet user.css to the wkhtmltopdf command (before toc/cover), and add to your user.css
.markdown-body .highlight pre,
.markdown-body pre{
overflow:visible !important;
}
Some link in the final pdf are off by +1 page, some are not. Not sure what the pattern is. But anchors with ids (#) do not appear to have this problem
Another option once you clone the wiki, especially if you are already using Atom is to use this Markdown to PDF package.
Worked great for me.
I found really annoying having to convert each markdown document separately (links between markdown documents are lost), so I ended up writting a simple C# program for my own use that does this in a single step: a) Download the last version of the wiki from Github, b) Convert it all the markdown documents merged as one pdf
You can download the binaries (Windows or any platform supporting Mono) from:
https://github.com/borjafdezgauna/CoderDocTools/releases/latest
If, for example, you want to convert to PDF the SimionZoo repository by user simionsoft, you can:
MarkdownToPDF.exe user=simionsoft project=SimionZoo output-file=SimionZoo.pdf
I've accomplished precisely this when creating the portable documentation for Barcode Writer in Pure PostScript:
GitHub Wiki + Makefile + pandoc → PDF
The process is described in this blog post.
This question has already been answered but wanted to add my quick experience here.
I didn't find it necessary to install the Desktop version of Github. You can clone by simply running the following from your commandline:
git clone git#github.com:<username>/<repository>.wiki.git
(Of course, replace username and repository as needed).
The cloned wiki outputted 72 markdown files. As has been previously said, there are numerous ways of converting these files do PDF, you can pick your own tool. However I will say that the easiest solution I encountered was to install Pandoc. I have macOS + homebrew, so a quick brew install pandoc was all I needed.
Some info on using pandoc here: https://stackoverflow.com/a/14908316/3638172
You can also try html_links_to_pdf!
It's a Python 3 script made just to convert a GitHub Wiki to pdf form, using the same styling that GitHub uses, but slightly cleaner.