wget not downloading pdfs in directories - pdf

Following problem: I'm trying to download a directory that contains pdfs, and it downloads the file structure, some of the pdfs but doesn't go deeper than the 2nd directory to download pdfs.
Details (theoretical)
So I have folder1/folder2/folder3(/folder4/folder5)
folder1 contains no pdfs, file structure contained in it, is downloaded.
folder 2 contains another folder and some pdfs, folders are created, pdfs are downloaded
folder 3 sometimes contains more folders, which are created but all pdfs contained in it and in the subfolders are not downloaded.
here is what I'm using to try to download all of it:
wget -r -l inf --no-remove-listing -np -c -w 3 --no-check-certificate -R "index.html*" -P "target directory" "https://etc./"
What am I doing wrong?

Solved it: -erobots=off was the solution. Which is weird since the site actually a wget command that I disagreed with, but still tried and had even less of an result than with my own commands, anyway -erobots=off was not mentioned in their orignally code so I figured and I didn't need it, but I did.

Related

How to convert multiple PDF to Html webpage

I need to convert 800+pdf files into html webpage, and every pdf file had own page on html webpage.
I tried to make with Adobe Acrobat, but what i get was every pdf merged in one big list.
So is there any way to automatically do this?
You could use pdftohtml on Linux and make it loop through all the files in the directory.
You can also find more information about pdftohtml on this thread: How to convert PDF to HTML?
pdf2htmlEX
Preserves formatting of the PDF file
Only works through docker (On new builds of Linux, this package is not present and deb packages are not installed)
sudo docker pull bwits/pdf2htmlex
sudo docker run -ti --rm -v /home/user/Documents/pdfToHtml:/pdf bwits/pdf2htmlex pdf2htmlEX --zoom 1.3 file.pdf

Is there a way to download multiple PDF's that are linked on a website?

I am trying to download a bunch of PDF's from the federal reserve archives but I have to click on a link and then view the PDF before I can download. Is there a way to automate this?
Example: https://fraser.stlouisfed.org/title/5170#521653 is a link to speeches and then you have to click the title, then view pdf, then the actual download button.
All of the remote .pdf files follow the path format:
https://fraser.stlouisfed.org/files/docs/historical/frbatl/speeches/guynn_xxxxxxxx.pdf
where each x is a placeholder for a digit.
So, yes, it's very easy to download a bunch of these PDFs in one go using the command-line in Terminal or whatever shell program you have access to.
If you're in a *nix-based operating system (including MacOS), that's good because your shell probably already has a command utility called curl installed. Windows may have it too, I'm not sure; I don't use Windows.
If you're using Windows, you'll have to make some tweaks to the code below, because the folder structures and file naming conventions are different, so the first couple of commands won't work.
But, if you're happy to proceed, open up a Terminal window, and type in this command to create a new directory in your Downloads folder, into which the .pdf files will be downloaded:
mkdir ~/Downloads/FRASER_PDFs; cd ~/Downloads/FRASER_PDFs
Hit Enter. Next, If there's no error, copy-n-paste this long command and then hit Enter:
curl --url \
"https://fraser.stlouisfed.org/files/docs/historical/frbatl/speeches/guynn_{"$(curl \
https://fraser.stlouisfed.org/title/5170#521653 --silent \
| egrep -io -e '/files/docs/historical/frbatl/speeches/guynn_\d+\.pdf' \
| egrep -o -e '\d+' | tr '\n' ',')"}.pdf" -O --remote-name-all
You can see this uses the URL you supplied in your question, from which that command retrieves all the .pdf links. If you need to do the same with other similar pages, provided they all use the same URL format, you can just substitute 5170#521653 with whatever page reference contains another list of .pdfs.

Changing the html file for apache

I recently bought a domain and put a html and css file to apache by using ubuntu(I dont even remember the exact commands.)Anyway now I want to change them.I removed the css file with cd /var/www/html and writing sudo rm blabla.css .But I am not sure about writing rm index.html since I am not sure what will be the effects.Also for some reason I got some problems when I tried to move my other css and html files.How can I accomplish it
For all the commands mentioned here you can see their help pages ie their manuals on Ubuntu by using
man rm
man cp
man rsync
etc. This command
rm index.html
Will remove the file completely ie if you hit your domain
http://www.example.com/
you will likely get an error indicating no page or depending on how your server is setup it might list the directory contents. Normally when editing a personal website people copy the new files over the top of the old one ie using something like rsync/ftp etc.
For instance if you do this
cp foo.html index.html
the cp command will overwrite the index.html file with the contents of foo.html. If you use ftp it will do the same thing but this time if you edit index.html on machine A and ftp it to machine b it effectively does this
cp machineA/index.html machineB/index.html
This allows you to work on one machine and copy the changes to the other.

excluding a directory from accurev using pop command

I have 10 directories in a AccuRev depot and don't want to populate one directory using "accurev pop" command. Is there any way? .acignore is not suiting to my requirements because in another jenkins build I need that folder. Just want to save time to avoid unnecessary populate of directories.
Any idea?
Thanks,
Sanjiv
I would create a stream off this stream and exclude the directories you dont want. Then you can pop this stream and only get the directories you want.
When you run the AccuRev populate command you can specify which directories to populate by specifying the directory name:
accurev pop -O -R thisDirectory
will cause the contents of thisDirectory to match the backing stream from the time of the last AccuRev update in that workspace.
The -O is for over write and the -R is for recurse. If you have active work in progress the -O will cause that work to be over written/destroyed.
The .acignore is only for (external) files and not those that are being managed by AccuRev.
David Howland

Custom `rsync` command to sync my Documents and Dropbox?

This is what I want to achieve:
Dropbox Directory Structure:
Dropbox/
1passwordstuff
Music
documentfolder1
documentfolder2
Documents Structure:
Documents/
documentfolder1
documentfolder2
Then, I want to do all of my work within the Documents folder. So let's say I make some changes to a file in documentfolder1, then I want to call a command like rsync ... and have all of my changes pushed into Dropbox. I've managed to achieve this with rsync -r --ignore-existing Documents Dropbox but there's a problem. Let's say I delete some files in Documents/documentfolder1/somefile then I want those files in my Dropbox folder to also get deleted. I don't know how to do this.
Any help?
Voted to close, since this question isn't programming-related, but I think you want rsync --delete.
Why not simply use the symbolic links?
Create a symbolic link in the dropbox folder to the Documents folder, and everything will get synced, and you still will work in your Documents location.
just go to your dropbox folder and run
ln -s PATH_TO_DOCUMENTS Documents