Is it possible to have Selenium crawl a TLD and incrementally export a list of any 404's found?
I'm stuck on a Windows machine for a few hrs and want to run some tests before back to the comfort of *nix...
I don't know Python very well, nor any of its commonly used libraries, but I'd probably do something like this (using C# code for the example, but the concept should apply):
// WARNING! Untested code here. May not completely work, and
// is not guaranteed to even compile.
// Assume "driver" is a validly instantiated WebDriver instance
// (browser used is irrelevant). This API is driver.get in Python,
// I think.
driver.Url = "http://my.top.level.domain/";
// Get all the links on the page and loop through them,
// grabbing the href attribute of each link along the way.
// (Python would be driver.find_elements_by_tag_name)
List<string> linkUrls = new List<string>();
ReadOnlyCollection<IWebElement> links = driver.FindElement(By.TagName("a"));
foreach(IWebElement link in links)
{
// Nice side effect of getting the href attribute using GetAttribute()
// is that it returns the full URL, not relative ones.
linkUrls.Add(link.GetAttribute("href"));
}
// Now that we have all of the link hrefs, we can test to
// see if they're valid.
List<string> validUrls = new List<string>();
List<string> invalidUrls = new List<string>();
foreach(string linkUrl in linkUrls)
{
HttpWebRequest request = WebRequest.Create(linkUrl) as HttpWebRequest;
request.Method = "GET";
// For actual .NET code, you'd probably want to wrap this in a
// try-catch, and use a null check, in case GetResponse() throws,
// or returns a type other than HttpWebResponse. For Python, you
// would use whatever HTTP request library is common.
// Note also that this is an extremely naive algorithm for determining
// validity. You could just as easily check for the NotFound (404)
// status code.
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
if (response.StatusCode == HttpStatusCode.OK)
{
validUrls.Add(linkUrl);
}
else
{
invalidUrls.Add(linkUrl);
}
}
foreach(string invalidUrl in invalidUrls)
{
// Here is where you'd log out your invalid URLs
}
At this point, you have a list of valid and invalid URLs. You could wrap this all up into a method that you could pass your TLD URL into, and call it recursively with each of the valid URLs. The key bit here is that you're not using Selenium to actually determine the validity of the links. And you wouldn't want to "click" on the links to navigate to the next page, if you're truly doing a recursive crawl. Rather, you'd want to navigate directly to the links found on the page.
There are other approaches you might take, like running everything through a proxy, and capturing the response codes that way. It depends a little on how you expect to structure your solution.
Related
This is an example of what I want to achieve, however I want to do my own custom attribute that also feeds itself from something other than the request url. In the case of HttpGet/HttpPost these built-in attributes obviously have to look at the http request method, but is there truly no way to make Url.Action() resolve the correct url then?
[HttpGet("mygeturl")]
[HttpPost("myposturl")]
public ActionResult IndexAsync()
{
// correct result: I get '/mygeturl' back
var getUrl = Url.Action("Index");
// wrong result: It adds a ?method=POST query param instead of returning '/myposturl'
var postUrl = Url.Action("Index", new { method = "POST" });
return View();
}
I've looked at the aspnet core source code and I truly can't find a feature that would work here. All the LinkGenerator source code seems to require routedata values but routedata always seems to require to be in the url somewhere, either in the path or in the query string. But even if I add the routedata value programmatically, it won't be in time for the action selection or the linkgenerator doesn't care.
In theory what I need is to pass something to the UrlHelper/LinkGenerator and have it understand that I want the url back out that I defined in my custom attribute, in this case the HttpPost (but I'll make my own attribute).
I need to navigate to a web site that ultimately contains a .pdf file and I want to save that file locally. I am using CEFSharp to do this. The nature of this site is such that once the .pdf appears in the browser, it cannot be accessed again. For this reason, I was wondering if once you have a .pdf displayed in the browser, is there a way to access the source for that file in the cache?
I have tried implementing IDownloadHandler and that works, but you have to click the save button on the embedded .pdf. I am trying to get around that.
OK, here is how I got it to work. There is a function in CEFSharp that allows you to filter an incoming web response. Consequently, this gives you complete access to the incoming stream. My solution is a little on the dirty side and not particularly efficient, but it works for my situation. If anyone sees a better way, I am open for suggestions. There are two things I have to assume in order for my code to work.
GetResourceResponseFilter is called every time a new page is downloaded.
The PDF is that last thing to be downloaded during the navigation process.
Start with the CEF Minimal Example found here : https://github.com/cefsharp/CefSharp.MinimalExample
I used the WinForms version. Implement the IRequestHandler and IResponseFilter in the form definition as follows:
public partial class BrowserForm : Form, IRequestHandler, IResponseFilter
{
public readonly ChromiumWebBrowser browser;
public BrowserForm(string url)
{
InitializeComponent();
browser = new ChromiumWebBrowser(url)
{
Dock = DockStyle.Fill,
};
toolStripContainer.ContentPanel.Controls.Add(browser);
browser.BrowserSettings.FileAccessFromFileUrls = CefState.Enabled;
browser.BrowserSettings.UniversalAccessFromFileUrls = CefState.Enabled;
browser.BrowserSettings.WebSecurity = CefState.Disabled;
browser.BrowserSettings.Javascript = CefState.Enabled;
browser.LoadingStateChanged += OnLoadingStateChanged;
browser.ConsoleMessage += OnBrowserConsoleMessage;
browser.StatusMessage += OnBrowserStatusMessage;
browser.TitleChanged += OnBrowserTitleChanged;
browser.AddressChanged += OnBrowserAddressChanged;
browser.FrameLoadEnd += browser_FrameLoadEnd;
browser.LifeSpanHandler = this;
browser.RequestHandler = this;
The declaration and the last two lines are the most important for this explanation. I implemented the IRequestHandler using the template found here:
https://github.com/cefsharp/CefSharp/blob/master/CefSharp.Example/RequestHandler.cs
I changed everything to what it recommends as default except for GetResourceResponseFilter which I implemented as follows:
IResponseFilter IRequestHandler.GetResourceResponseFilter(IWebBrowser browserControl, IBrowser browser, IFrame frame, IRequest request, IResponse response)
{
if (request.Url.EndsWith(".pdf"))
return this;
return null;
}
I then implemented IResponseFilter as follows:
FilterStatus IResponseFilter.Filter(Stream dataIn, out long dataInRead, Stream dataOut, out long dataOutWritten)
{
BinaryWriter sw;
if (dataIn == null)
{
dataInRead = 0;
dataOutWritten = 0;
return FilterStatus.Done;
}
dataInRead = dataIn.Length;
dataOutWritten = Math.Min(dataInRead, dataOut.Length);
byte[] buffer = new byte[dataOutWritten];
int bytesRead = dataIn.Read(buffer, 0, (int)dataOutWritten);
string s = System.Text.Encoding.UTF8.GetString(buffer);
if (s.StartsWith("%PDF"))
File.Delete(pdfFileName);
sw = new BinaryWriter(File.Open(pdfFileName, FileMode.Append));
sw.Write(buffer);
sw.Close();
dataOut.Write(buffer, 0, bytesRead);
return FilterStatus.Done;
}
bool IResponseFilter.InitFilter()
{
return true;
}
What I found is that the PDF is actually downloaded twice when it is loaded. In any case, there might be header information and what not at the beginning of the page. When I get a stream segment that begins with %PDF, I know it is the beginning of a PDF so I delete the file to discard any previous contents that might be there. Otherwise, I just keep appending each segment to the end of the file. Theoretically, the PDF file will be safe until you navigate to another PDF, but my recommendation is to do something with the file as soon as the page is loaded just to be safe.
I made a custom editor plugin, in a Seam 2.2.2 project, which makes file upload this way:
1) config the editor to load my specific xhtml upload page;
2) call the following method inside this page, and return a javascript callback;
public String sendImageToServer()
{
HttpServletRequest request = ServletContexts.instance().getRequest();
try
{
List<FileItem> items = new ServletFileUpload(new DiskFileItemFactory()).parseRequest(request);
processItems(items);//set the file data to specific att
saveOpenAttachment();//save the file to disk
}
//build callback
For this to work I have to put this inside components.xml:
<web:multipart-filter create-temp-files="false"
max-request-size="1024000" url-pattern="*"/>
The attribute create-temp-files do not seems to matter whatever its value.
But url-pattern has to be "" or "/myUploadPage.seam", any other value makes the item list returns empty. Does Anyone know why?
This turns into a problem because when I use a url-pattern that work to this case, every form with enctype="multipart/form-data" in my application stops to submit data. So I end up with other parts of the system crashing.
Could someone help me?
To solve my problem, I changed the solution to be like Seam multipart filter handle requests:
ServletRequest request = (ServletRequest) FacesContext.getCurrentInstance().getExternalContext().getRequest();
try
{
if (!(request instanceof MultipartRequest))
{
request = unwrapMultipartRequest(request);
}
if (request instanceof MultipartRequest)
{
MultipartRequest multipartRequest = (MultipartRequest) request;
String clientId = "upload";
setFileData(multipartRequest.getFileBytes(clientId));
setFileContentType(multipartRequest.getFileContentType(clientId));
setFileName(multipartRequest.getFileName(clientId));
saveOpenAttachment();
}
}
Now I handle the request like Seam do, and do not need the web:multipart-filter config that was breaking other types of request.
RavenDB throws InvalidOperationException when IsOperationAllowedOnDocument is called using embedded mode.
I can see in the IsOperationAllowedOnDocument implementation a clause checking for calls in embedded mode.
namespace Raven.Client.Authorization
{
public static class AuthorizationClientExtensions
{
public static OperationAllowedResult[] IsOperationAllowedOnDocument(this ISyncAdvancedSessionOperation session, string userId, string operation, params string[] documentIds)
{
var serverClient = session.DatabaseCommands as ServerClient;
if (serverClient == null)
throw new InvalidOperationException("Cannot get whatever operation is allowed on document in embedded mode.");
Is there a workaround for this other than not using embedded mode?
Thanks for your time.
I encountered the same situation while writing some unit tests. The solution James provided worked; however, it resulted in having one code path for the unit test and another path for the production code, which defeated the purpose of the unit test. We were able to create a second document store and connect it to the first document store which allowed us to then access the authorization extension methods successfully. While this solution would probably not be good for production code (because creating Document Stores is expensive) it works nicely for unit tests. Here is a code sample:
using (var documentStore = new EmbeddableDocumentStore
{ RunInMemory = true,
UseEmbeddedHttpServer = true,
Configuration = {Port = EmbeddedModePort} })
{
documentStore.Initialize();
var url = documentStore.Configuration.ServerUrl;
using (var docStoreHttp = new DocumentStore {Url = url})
{
docStoreHttp.Initialize();
using (var session = docStoreHttp.OpenSession())
{
// now you can run code like:
// session.GetAuthorizationFor(),
// session.SetAuthorizationFor(),
// session.Advanced.IsOperationAllowedOnDocument(),
// etc...
}
}
}
There are couple of other items that should be mentioned:
The first document store needs to be run with the UseEmbeddedHttpServer set to true so that the second one can access it.
I created a constant for the Port so it would be used consistently and ensure use of a non reserved port.
I encountered this as well. Looking at the source, there's no way to do that operation as written. Not sure if there's some intrinsic reason why since I could easily replicate the functionality in my app by making a http request directly for the same info:
HttpClient http = new HttpClient();
http.BaseAddress = new Uri("http://localhost:8080");
var url = new StringBuilder("/authorization/IsAllowed/")
.Append(Uri.EscapeUriString(userid))
.Append("?operation=")
.Append(Uri.EscapeUriString(operation)
.Append("&id=").Append(Uri.EscapeUriString(entityid));
http.GetStringAsync(url.ToString()).ContinueWith((response) =>
{
var results = _session.Advanced.DocumentStore.Conventions.CreateSerializer()
.Deserialize<OperationAllowedResult[]>(
new RavenJTokenReader(RavenJToken.Parse(response.Result)));
}).Wait();
I am new to Web Crawling, and I am using HttpWebRequest to crawl data from sites.
As of now I was successfully able to crawl and get data from my wordpress site. This data was a simple user profile data. (like name, email, AIM id etc...)
Now as an exercise I want to crawl wikipedia, where I will search using the value entered into textbox at my end and then crawl wikipedia with the search value and get the appropriate title(s) from the search.
Now I have the following doubts/difficulties.
Firstly, is this even possible ? I have heard that wiki has robot.txt setup to block this. Though I have heard this only from a friend and hence not sure.
I am using the same procedure I used earlier, but I am not getting the required results.
Thanks !
Update :
After some explanation and help from #svick, I tried the below code, but still not able to get any value (see last line of code, there I am expecting an html markup of the search result page)
string searchUrl = "http://en.wikipedia.org/w/index.php?search=Wikipedia&title=Special%3ASearch";
var postData = new StringBuilder();
postData.Append("search=" + model.Query);
postData.Append("&");
postData.Append("title" + "Special:Search");
byte[] data2 = Crawler.GetEncodedData(postData.ToString());
var webRequest = (HttpWebRequest)WebRequest.Create(searchUrl);
webRequest.Method = "POST";
webRequest.UserAgent = "Crawling HW (http://yassershaikh.com/contact-me/)";
webRequest.AllowAutoRedirect = false;
ServicePointManager.Expect100Continue = false;
Stream requestStream = webRequest.GetRequestStream();
requestStream.Write(data2, 0, data2.Length);
requestStream.Close();
var responseCsv = (HttpWebResponse)webRequest.GetResponse();
Stream response = responseCsv.GetResponseStream();
// Todo Parsing
var streamReader = new StreamReader(response);
string val = streamReader.ReadToEnd();
// val is empty !! <-- this is my problem !
and here is my GetEncodedData method defination.
public static byte[] GetEncodedData(string postData)
{
var encoding = new ASCIIEncoding();
byte[] data = encoding.GetBytes(postData);
return data;
}
Pls help me on this.
You probably don't need to use HttpWebRequest. Using WebClient (or HttpClient if you're on .Net 4.5) will be much easier for you.
robots.txt doesn't actually block anything. If something doesn't support it (and .Net doesn't support it), it can access anything.
Wikipedia does block requests that don't have their User-Agent header set. And you should use an informative User-Agent string with your contact information.
A better way to access Wikipedia is to use its API, rather than scraping. This way, you will get an answer that's specifically meant to be read by a custom applications, formatted as XML or JSON. There are also dumps containing all information from Wikipedia available for download.
EDIT: The problem with your newly posted code is that your query returns a 302 Moved Temporarily response to the searched article, if it exists. Either remove the line that forbids AllowAutoRedirect, or add &fulltext=Search to your query, which will mean you won't get redirected.