Is there a webcrawler that can download an entire site? - dynamic

Need to know if there is a crawler/downloader that can crawl and download and entire website with at least a link depth of 4 pages. The site I am trying to download has java script hyperlinks that are rendered only by a browser and thus the crawler is unable crawl these hyperlinks unless the crawler itself renders them!!!

Ive used Teleport Pro and it works well

Metaproducts Offline Explorer boasts doing what you need.

Related

Page resources could not be loaded by Googlebot

When I check my website URL on Google URL inspection tool it shows that page resources could not be loaded i.e image, stylesheet and script files. However, my website is working perfectly on a live server and the website is not rendered properly by Googlebot smartphone. I have tried everything to remove these errors but nothing helped. I have also checked that these resources are not blocked in robots.txt file.
Screenshot of page resources error
I've been struggling with this for a couple of days now, and finally reached the only solution that has worked for me. In my case, it wasn't a robots.txt problem, as I believe that you've already checked before posting this.
The problem has to do with the number of resources Googlebot is willing to fetch before giving up. If your CSS and JS files are too many, or too big, Googlebot gives up before fetching all of the resources needed to render the page properly.
You can solve it by minifying your files via a server mod, or via plugins like WP Rocket or Autoptimize. If you have too many CSS and JS files and the problem persists after minifying, try combining these files as well by using the same plugins.

How can I automatically download pdf files from a pool of websites as they are uploaded, real-time?

I want to automatically download pdf files from a pool of sites like these:
https://www.wfp.org/publications?f%5B0%5D=topics%3A2234
https://www.unhcr.org/search?comid=4a1d3b346&cid=49aea93a6a&scid=49aea93a39&tags=evaluation%20report
https://www.unicef.org/evaluation/reports#/
I then want to upload them onto my own site.
Can I use Python to build a script for this function? I'd need to scrape the websites periodically so that, as soon as a new file is uploaded, the file is automatically downloaded to my server.
Lastly, assuming I'm sharing these on my own website for non-profit purposes, is this legal?
You can use the python-moduled requests and beautifulsoup4 to periodically scrape the websites and download the pdfs like so Download files using requests and BeautifulSoup .
Then you can save them in your servers web-path and display them dynamically.
I'm not a lawyer but i think this is not legal. Its like secretly recording a movie in the cinema and then sharing it online which is super not legal.

How to preview a static site?

With Ruby on Rails I can run rails s -p 3000 and preview my site at localhost:3000.
With React I can run npm start and view the site at localhost:8080.
What if I just have html and CSS files, how do I preview that?
On OSX, you can run a simple web server from any directory using this command:
python -m SimpleHTTPServer 8000
Then, you can hit the directory in your browser by going to http://localhost:8000/path/to/file.html
You can try click 2 times in index.html to open the file in browser.
Every time you update the code in sublime text, you need to reload the browser and the updates will be aplied.
This usually depends on your device/OS and what your eventual goals are, but usually you can either use (online) software that renders the HTML and the CSS for you (such as Brackets.io, etc.) whilst you are typing it (to live preview any modifications/additions), or you can put the documents live using a local webserver such as Xampp or OSXs built in simple web server, and check their respective localhost locations every time you save changes using your code editor.
You could also simply use online applications like Codepen, which can also render HTML and CSS, and even JavaScript. Codepen just today launched Codepen Projects which allows you to create entire website projects at their website. It is however, a pro (paid) feature.
Here's a short overview of code playgrounds that offer the functionality you requested (by no means an exhaustive list):
JSFiddle
CSSDeck
CodePen
JSBin
And ofcourse you can insert Snippets here on StackOverflow, which is also able to render HTML and CSS (and JS).
If you really only use HTML and CSS, previewing in a browser is also possible, opening the .html file by double clicking and opening in Internet Explorer, Chrome, etc.

Download image from a lot of pages that located on a one website

I need to download image from a lot of pages that located on a one website.
More specific:
I get a list of url's from one website.
Image on every page have the same class name.
So I need to download this image from every page.
How to automate this work? Please, help me to create a script.
Thanks!
try with Wget : http://en.wikipedia.org/wiki/Wget
You have to create a bash script on linux. Maybe search on google to find a script wich is already done.

FAST Search for Sharepoint Crawler issue with Dokuwiki pages

My level of frustion is maxxing out over crawling Dokuwiki sites.
I have a content source using FAST search for SharePoint that i have set up to crawl a dokuwiki/doku.php site. My crawler rules are set to: http://servername/* , match case and include all items in this path with crawl complex urls.. testing the content source in the crawl rules shows that it will be crawled by the crawler. However..... The crawl always last for under 2 minutes and completes having only crawled the page I pointed to and no other link on that page. I have check with the Dokuwki admin and he has the robots text set to allow. when I look at the source on the pages I see that it says
meta name="robots" content="index,follow"
so in order to test that the other linked pages were not a problem, I added those links to the content souce manually and recrawled.. example source page has three links
site A
site B
site C.
I added Site A,B and C urls to the crawl source. The results of this crawl are 4 successes, the primary souce page and the other links A,B, and C i manually added.
So my question is why wont the crawler crawl the link on the page? is this something I need to do with the crawler on my end or is it something to do with how namespaces are defined and links constructed with Dokuwiki?
Any help would be appreciated
Eric
Did you disable the delayed indexing options and rel=nofollow options?
The issue was around authentication even though no issues were reported suggesting it was authentication in the FAST Crawl Logs.
The fix was adding a $freepass setting for the IP address of the Search indexing server so that Appache would not go through the authentication process for each page hit.
Thanks for the reply
Eric