So I'm trying to download the entire domain of a private wiki page. I've saved the cookies in a cookies.txt file and using it with wget for authentication like so:
wget --load-cookies=cookies.txt --recursive --no-parent --convert-links --backup-converted --adjust-extension --limit-rate=500k https://wiki-to-download
It proceeds to download the entire wiki domain. At first glance, it seemed to have worked. I opened up the main page html file locally in my browser but almost all of the links besides the home page are the same: the login page...
I'm guessing it authenticated me once allowing the download of the home page, but then doesn't keep my credentials saved as it retrieves the rest of the pages, forcing it to download the dreaded "Login-required page" for each. How could i avoid this? In other words, how i can i make sure every file gets downloaded correctly, as if i were logged in the whole time?
Related
I'm trying to add a test coverage badge to the Readme of a private repository on GitHub. Our continuous integration process saves out the image to a secured Google Cloud Storage bucket that's not accessible to the public, and should remain that way.
Google's authorization layer is smart enough that if I go to the URL for the image, I'm automatically redirected to the resource with a valid auto-generated signed URL.
E.g., if I go to http://storage.cloud.google.com/secret-files/mysecretfile.png, then if I'm logged in and allowed to view it, I'm automatically redirected to something like https://blahblah-apidata.googleusercontent.com/download/storage/v1/b/secret-files/o/mysecretfile.png?key=verylongkey, where I can load the image.
This seemed perfect. Reference the canonical path in the GitHub Readme, authenticated users see the image, unauthenticated users are still blocked, we don't have to make the file public, and we don't have to do anything complicated.
Except that GitHub is proxying the image request, meaning that it will always be unauthenticated. My browser is loading something like https://camo.githubusercontent.com/mysecretimage.png.
Is there a clever way to work around this? Or do I need to go back to the drawing board?
All images on github.com are proxied using the Camo image proxy. There are a couple reasons for this:
It preserves the privacy of users. It isn't possible for a document to track users by directing them to a different site or using cookies to track them.
It means images can be cached and served at an appropriate size.
GitHub can have a very strict content security policy that does not allow loading from untrusted sites, which means that any sort of accidental security problem (like an XSS) is a lot less likely to work.
Note the last part. Even if you found some sneaky way to get another image URL to render properly in the website, your browser wouldn't load it because it violates the Content-Security-Policy header the site sent, and moreover, your browser would tattle about that to the reporting URL that GitHub provided.
So any image URL you provide will need to be readable by GitHub's image proxy and it won't be possible to serve different content to different users.
on a website I display links to PDF files.
When the first time call for a file arrives, the request gets redirected to a php-script that generates and returns the file. Additionally, it saves the file to the linked location so next time it will be directly availibe. I send the pdf mime type to make the browser open a download dialog instead of redirecting.
Due to reasony beyond my control, one out of 20 files cannot be generated.
How to respond?
Error 404 or 500 would direct the browser to an error page, while sending a mime-type would let the user download an empty / defect pdf file. Is there an established best practise? How to let the user know that a file link is broken, yet keep him on the site without redirect?
I had the same problem and solved it as follows:
If you have link to file, for example:
<a download href="/files/document.pdf">Click to download</a>
And if you don't want the browser redirect to blank/error page if the file doesn't exist, just reply with 204 without any content.
Nothing will happen, the user will stay where he is without redirection.
In php it would look something like this:
if (!readfile("/files/document.pdf") {
http_response_code(204);
die();
}
I want to use an application that checks for broken links. I got to know that, Xenu is one such software. I do not have access to internal aspx/http files on a drive. The Problem I am facing is the Website requires the user to be authenticated. After login I need to crawl the site to determine which links are broken.
As an example, I kick off with mail.google.com. We end up typing the Username and password after which we are served different URLs. If I give the Xenu (or similar programs) the link such as mail.google.com it will not be able to fecth URLs inside the mail.google.com which will be of type - /mail/u/0/?shva=1#inbox/ etc. There lies the problem.
With minimal or least scripting language how can I provide Xenu (or other similar app) capability to Login by providing external URL (mail.google.com) in this example in order to do whatever xenu has to do.
Thanks
Balaji S
Xenu can be used with an authenticated user as long as the cookies are persistent. You will need to enable cookies in Xenu and login once yourself using IE.
From their FAQ:
By default, cookies are disabled, and Xenu rejects all cookies. If you
need cookies because
you have used Internet Explorer to authenticate yourself before
starting a run
to prevent the server from delivering URLs with a
session ID
then you can enable the cookies in the advanced options
dialog. (This has been available since Version 1.2g)
Warning: You
should not use this option if you have links that delete data, e.g. a
database or a shop - you are risking data loss!!!
You can enable cookies in the Options menu. Click Preferences and switch to the Advanced tab.
For single page applications (like gmail) you will also need to configure Xenu to parse Javascript
This is done by modifying the ini file (traditionally at C:\Program Files (x86)\Xenu135\Xenu.ini) and adding a line of code under [Options]
Javascript=[Jj]ava[Ss]cript: *[_a-zA-Z0-9]+ *\( *['"]((/|ftp://|https?://)[^'"]+)['"]
There are several variations provided in their FAQ, but I didn't get them to work perfectly.
I have a script that is used across multiple pages on my site. I want to set the expires header so that browsers cache it and it doesn't get downloaded every time. That's ok and I understand how to do that, but I don't quite know how the browser works.
Does the browser cache it according to its path and then is it smart enough to know that any page requesting the script should use the cached version, or is there an association between the script and the page and therefore it would have to be cached against each page?
In the browser cache, there is no connection between the URL and the requesting page. Browser cache keys contain the path and sometimes the query string (see Is it the filename or the whole URL used as a key in browser caches?).
That's why Google recommends using their Libraries API: If every page that requires a specific version of jQuery pointed the browser to fetch the library from Google, the browser would fetch it only once for www.xyz.com and then re-use it from its cache for www.abc.com.
I have a password-protected Apache web directory I'm testing. When I first access the directory, it requires that I login in. However, on subsequent tries it let's me right in, even after I clear my browser cache- how do I get it to force a login again?
The browser stores the credentials and sends them along with every request - usually, for the duration of the current session.
Closing the browser and re-opening it makes it usually forget the credentials.
Forcing the browser to forget credentials (i.e. logging out) is tricky. See HTTP authentication logout via PHP for some approaches.
Easiest way I've found:
Using Firefox 4 on Mac,
Go to 'Tools' > 'Clear Recent History...' > 'Active Logins'
Refresh the page (You don't have to close the window)