Crawl Wikipedia using ASP.NET HttpWebRequest - httpwebrequest

I am new to Web Crawling, and I am using HttpWebRequest to crawl data from sites.
As of now I was successfully able to crawl and get data from my wordpress site. This data was a simple user profile data. (like name, email, AIM id etc...)
Now as an exercise I want to crawl wikipedia, where I will search using the value entered into textbox at my end and then crawl wikipedia with the search value and get the appropriate title(s) from the search.
Now I have the following doubts/difficulties.
Firstly, is this even possible ? I have heard that wiki has robot.txt setup to block this. Though I have heard this only from a friend and hence not sure.
I am using the same procedure I used earlier, but I am not getting the required results.
Thanks !
Update :
After some explanation and help from #svick, I tried the below code, but still not able to get any value (see last line of code, there I am expecting an html markup of the search result page)
string searchUrl = "http://en.wikipedia.org/w/index.php?search=Wikipedia&title=Special%3ASearch";
var postData = new StringBuilder();
postData.Append("search=" + model.Query);
postData.Append("&");
postData.Append("title" + "Special:Search");
byte[] data2 = Crawler.GetEncodedData(postData.ToString());
var webRequest = (HttpWebRequest)WebRequest.Create(searchUrl);
webRequest.Method = "POST";
webRequest.UserAgent = "Crawling HW (http://yassershaikh.com/contact-me/)";
webRequest.AllowAutoRedirect = false;
ServicePointManager.Expect100Continue = false;
Stream requestStream = webRequest.GetRequestStream();
requestStream.Write(data2, 0, data2.Length);
requestStream.Close();
var responseCsv = (HttpWebResponse)webRequest.GetResponse();
Stream response = responseCsv.GetResponseStream();
// Todo Parsing
var streamReader = new StreamReader(response);
string val = streamReader.ReadToEnd();
// val is empty !! <-- this is my problem !
and here is my GetEncodedData method defination.
public static byte[] GetEncodedData(string postData)
{
var encoding = new ASCIIEncoding();
byte[] data = encoding.GetBytes(postData);
return data;
}
Pls help me on this.

You probably don't need to use HttpWebRequest. Using WebClient (or HttpClient if you're on .Net 4.5) will be much easier for you.
robots.txt doesn't actually block anything. If something doesn't support it (and .Net doesn't support it), it can access anything.
Wikipedia does block requests that don't have their User-Agent header set. And you should use an informative User-Agent string with your contact information.
A better way to access Wikipedia is to use its API, rather than scraping. This way, you will get an answer that's specifically meant to be read by a custom applications, formatted as XML or JSON. There are also dumps containing all information from Wikipedia available for download.
EDIT: The problem with your newly posted code is that your query returns a 302 Moved Temporarily response to the searched article, if it exists. Either remove the line that forbids AllowAutoRedirect, or add &fulltext=Search to your query, which will mean you won't get redirected.

Related

Wicket 6 - Capturing HttpServletRequest parameters in Multipart form?

USing Wicket 6.17 and servlet 2.5, I have a form that allows file upload, and also has ReCaptcha (using Recaptcha4j). When the form has ReCaptcha without file upload, it works properly using the code:
final HttpServletRequest servletRequest = (HttpServletRequest ) ((WebRequest) getRequest()).getContainerRequest();
final String remoteAddress = servletRequest.getRemoteAddr();
final String challengeField = servletRequest.getParameter("recaptcha_challenge_field");
final String responseField = servletRequest.getParameter("recaptcha_response_field");
to get the challenge and response fields so that they can be validated.
This doesn't work when the form has the file upload because the form must be multipart for the upload to work, and so when I try to get the parameters in that fashion, it fails.
I have pursued trying to get the parameters differently using ServletFileUpload:
ServletFileUpload fileUpload = new ServletFileUpload(new DiskFileItemFactory(new FileCleaner()) );
String response = IOUtils.toString(servletRequest.getInputStream());
and
ServletFileUpload fileUpload = new ServletFileUpload(new DiskFileItemFactory(new FileCleaner()) );
List<FileItem> requests = fileUpload.parseRequest(servletRequest);
both of which always return empty.
Using Chrome's network console, I see the values that I'm looking for in the Request Payload, so I know that they are there somewhere.
Any advice on why the requests are coming back empty and how to find them would be greatly appreciated.
Update: I have also tried making the ReCaptcha component multipart and left out the file upload. The result is still the same that the response is empty, leaving me with the original conclusion about multipart form submission being the problem.
Thanks to the Wicket In Action book, I have found the solution:
MultipartServletWebRequest multiPartRequest = webRequest.newMultipartWebRequest(getMaxSize(), "ignored");
// multiPartRequest.parseFileParts(); // this is needed since Wicket 6.19.0+
IRequestParameters params = multiPartRequest.getRequestParameters();
allows me to read the values now using the getParameterValue() method.

Selenium build list of 404s

Is it possible to have Selenium crawl a TLD and incrementally export a list of any 404's found?
I'm stuck on a Windows machine for a few hrs and want to run some tests before back to the comfort of *nix...
I don't know Python very well, nor any of its commonly used libraries, but I'd probably do something like this (using C# code for the example, but the concept should apply):
// WARNING! Untested code here. May not completely work, and
// is not guaranteed to even compile.
// Assume "driver" is a validly instantiated WebDriver instance
// (browser used is irrelevant). This API is driver.get in Python,
// I think.
driver.Url = "http://my.top.level.domain/";
// Get all the links on the page and loop through them,
// grabbing the href attribute of each link along the way.
// (Python would be driver.find_elements_by_tag_name)
List<string> linkUrls = new List<string>();
ReadOnlyCollection<IWebElement> links = driver.FindElement(By.TagName("a"));
foreach(IWebElement link in links)
{
// Nice side effect of getting the href attribute using GetAttribute()
// is that it returns the full URL, not relative ones.
linkUrls.Add(link.GetAttribute("href"));
}
// Now that we have all of the link hrefs, we can test to
// see if they're valid.
List<string> validUrls = new List<string>();
List<string> invalidUrls = new List<string>();
foreach(string linkUrl in linkUrls)
{
HttpWebRequest request = WebRequest.Create(linkUrl) as HttpWebRequest;
request.Method = "GET";
// For actual .NET code, you'd probably want to wrap this in a
// try-catch, and use a null check, in case GetResponse() throws,
// or returns a type other than HttpWebResponse. For Python, you
// would use whatever HTTP request library is common.
// Note also that this is an extremely naive algorithm for determining
// validity. You could just as easily check for the NotFound (404)
// status code.
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
if (response.StatusCode == HttpStatusCode.OK)
{
validUrls.Add(linkUrl);
}
else
{
invalidUrls.Add(linkUrl);
}
}
foreach(string invalidUrl in invalidUrls)
{
// Here is where you'd log out your invalid URLs
}
At this point, you have a list of valid and invalid URLs. You could wrap this all up into a method that you could pass your TLD URL into, and call it recursively with each of the valid URLs. The key bit here is that you're not using Selenium to actually determine the validity of the links. And you wouldn't want to "click" on the links to navigate to the next page, if you're truly doing a recursive crawl. Rather, you'd want to navigate directly to the links found on the page.
There are other approaches you might take, like running everything through a proxy, and capturing the response codes that way. It depends a little on how you expect to structure your solution.

VB.NET Upload Photos to Facebook using Graph API

Looking for some code to upload a photo to facebook using the Graph API in VB.NET. I have the Facebook C# SDK, but it doesn't support uploading photos as far as I can tell.
Accessing the photos works fine and I can send other content to Facebook fine as well. Just not photos.
The facebook documentation talks about attaching the file as a form-multipart request, but I have no idea how to do that. To say that its not very well documented is to put it lightly. Even the guys I hire to do this kind of thing couldn't get it to work.
I've found this: Upload Photo To Album with Facebook's Graph API, but it only describes how to do it in PHP.
I've also seen varying method from different sites about passing the URL of the photo as a part of the HTTP request, but after trying local or remote URLs several times I kept getting a bad url error or something like that.
Any thoughts?
You need to pass the Image in a POST request to the Graph API (Need publish_stream permission). What is mentioned in Facebook Documentation is correct. Following is the example code that may do the work. use it inside a method. (Code is in C#)
Legend
<content> : you need to provide the info.
Update
Please post comments to improve the code.
string ImageData;
string queryString = string.Concat("access_token=", /*<Place your access token here>*/);
string boundary = DateTime.Now.Ticks.ToString("x", CultureInfo.InvariantCulture);
StringBuilder sb = String.Empty;
sb.Append("----------").Append(boundary).Append("\r\n");
sb.Append("Content-Disposition: form-data; filename=\"").Append(/*<Enter you image's flename>*/).Append("\"").Append("\r\n");
sb.Append("Content-Type: ").Append(String.Format("Image/{0}"/*<Enter your file type like jpg, bmp, gif, etc>*/)).Append("\r\n").Append("\r\n");
using (FileInfo file = new FileInfo("/*<Enter the full physical path of the Image file>*/"))
{
ImageData = file.OpenText().ReadToEnd();
}
byte[] postHeaderBytes = Encoding.UTF8.GetBytes(sb.ToString());
byte[] fileData = Encoding.UTF8.GetBytes(ImageData);
byte[] boundaryBytes = Encoding.UTF8.GetBytes(String.Concat("\r\n", "----------", boundary, "----------", "\r\n"));
var postdata = new byte[postHeaderBytes.Length + fileData.Length + boundaryBytes.Length];
Buffer.BlockCopy(postHeaderBytes, 0, postData, 0, postHeaderBytes.Length);
Buffer.BlockCopy(fileData, 0, postData, postHeaderBytes.Length, fileData.Length);
Buffer.BlockCopy(boundaryBytes, 0, postData, postHeaderBytes.Length + fileData.Length, boundaryBytes.Length);
var requestUri = new UriBuilder("https://graph.facebook.com/me/photos");
requestUri.Query = queryString;
var request = (HttpWebRequest)HttpWebRequest.Create(requestUri.Uri);
request.Method = "POST";
request.ContentType = String.Concat("multipart/form-data; boundary=", boundary);
request.ContentLength = postData.Length;
using (var dataStream = request.GetRequestStream())
{
dataStream.Write(postData, 0, postData.Length);
}
request.GetResponse();

Why am I getting System.FormatException: String was not recognized as a valid Boolean on a fraction of our customers machines?

Our c#.net software connects to an online app to deal with accounts and a shop. It does this using HttpWebRequest and HttpWebResponse.
An example of this interaction, and one area where the exception in the title has come from is:
var request = HttpWebRequest.Create(onlineApp + string.Format("isvalid.ashx?username={0}&password={1}", HttpUtility.UrlEncode(username), HttpUtility.UrlEncode(password))) as HttpWebRequest;
request.Method = "GET";
using (var response = request.GetResponse() as HttpWebResponse)
using (var ms = new MemoryStream())
{
var responseStream = response.GetResponseStream();
byte[] buffer = new byte[4096];
int read;
do
{
read = responseStream.Read(buffer, 0, buffer.Length);
ms.Write(buffer, 0, read);
} while (read > 0);
ms.Position = 0;
return Convert.ToBoolean(Encoding.ASCII.GetString(ms.ToArray()));
}
The online app will respond either 'true' or 'false'. In all our testing it gets one of these values, but for a couple of customers (out of hundreds) we get this exception System.FormatException: String was not recognized as a valid Boolean Which sounds like the response is being garbled by something. If we ask them to go to the online app in their web browser, they see the correct response. The clients are usually on school networks which can be fairly restrictive and often under proxy servers, but most cope fine once they've put the proxy details in or added a firewall exception. Is there something that could be messing up the response from the server, or is something wrong with our code?
Indeed, it's possible that the return result is somehow different.
Is there any particular reason you are doing the reasonably elaborate method of reading the repsonse there? Why not:
string data;
using(HttpWebResponse response = request.GetResponse() as HttpWebResponse){
StreamReader str = new StreamReader(response.GetResponseStream());
data = str.ReadToEnd();
str.Close();
}
string cleanResult = data.Trim().ToLower();
// log this
return Convert.ToBoolean(cleanResult);
First thing to note is I would definitely use something like:
bool myBool = false;
Boolean.TryParse(Encoding.ASCII.GetString(ms.ToArray()), myBool);
return myBool;
It's not some localisation issue is it? It's expecting the Swahili version of 'true', and getting confused. Are all the sites in one country, with the same language, etc?
I'd add logging, as suggested by others, and see what results you're seeing.
I'd also lean towards changing the code as silky suggested, though with a few further changes from me (code 'smell' issues, IMO); Use using around the stream reader, as well as the response.
Also, I don't think the use of as is appropriate in this instance. If the Response can't be cast to HttpWebResponse (which, admittedly is unlikely, but still) you'll get a NullRef exception on the response.GetResponseStream() bit which is both a vague error, and you've lost the original line number. Using (HttpWebResponse)request.GetResponse() will give you a more correct error, and the correct line number of the actual error.

Why am I getting a "double response" from HttpWebResponse?

The follow code (running in ASP.Net 2.0) displays the contents of the requested URL twice. I only want it to display the contents of the requested URL once. I can't figure out what I'm doing wrong. The URL requested is returning XML and if I visit the URL directly, it works fine.
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
byte[] postDataBytes = Encoding.UTF8.GetBytes(postData);
request.Method = "POST";
request.ContentType = "application/xml";
request.ContentLength = postDataBytes.Length;
Stream requestStream = request.GetRequestStream();
requestStream.Write(postDataBytes, 0, postDataBytes.Length);
requestStream.Close();
// get response and write to console
response = (HttpWebResponse) request.GetResponse();
StreamReader responseReader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
try {
Response.Write(responseReader.ReadToEnd());
}
finally {
responseReader.Close();
}
response.Close();
Your code looks good, so I don't think the problem is there... but what I would suggest is the following:
1) Maybe the error is on the URL's other end... so try hitting Google and see if the returned content is good or not.
2) Put a breakpoint at the "responseReader.ReadToEnd()" spot, and see if what's coming out of there is good.
3) If this code above is in an ASPX page... are you making sure to call "Response.End();" after you're last line of code? (not "resposne.close()", but "Response.End()").
I found the problem. It's not with the above code at all, but with the page being called. The page I was calling was inherited from a class whose Page_OnInit method contained the following line: "MyBase.OnLoad(e)", which caused the Page_OnLoad method to be executed twice. Obviously, it should have been MyBase.OnInit(e) instead. I didn't catch it because when I tested the page directly I had to temporarily remove the inheritance from the class because of some other code that would've have prevented me from testing the page directly.
I will now put on my "Dunce" hat and retreat to the corner for a time out. Thanks anyway for the help.