I would like to get the source code of this page for exemple:
My page URL
I used Webclient (DownloadString and DownloadFile) or HttpWebRequest. But, I always get return an empty string (Code source).
With firefox, Edge or other browser, I get the code source without problem.
How can I get the source code of the given exemple.
This a code of many codes that I used:
Using client = New WebClient()
client.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 10.0; rv:40.0) Gecko/20100101 Firefox/40.0")
Dim MyURL As String = "https://www.virustotal.com/fr/file/c65ce5ab02b69358d07b56434527d3292ea2cb12357047e6a396a5b27d9ef680/analysis/"
Dim Source_Code As String = client.DownloadString(MyURL)
MsgBox(Source_Code)
textbox1.text = Source_Code
End Using
NB 2: Webclient works fine with all other sites.
NB 1: I don't like to use Webbrowser or such control.
It seems the target server is picky and requires the Accept-Language header to return any content. The following code returns the page's content:
var url="https://www.virustotal.com/fr/file/c65ce5ab02b69358d07b56434527d3292ea2cb12357047e6a396a5b27d9ef680/analysis/";
var client=new System.Net.WebClient();
client.Headers.Add("Accept-Language","en");
var content=client.DownloadString(url);
If the Accept-Language header is missing, no data is returned.
To find this, you can use a tool like Fiddler to capture the HTTP request and responses from your browser and application. By removing one by one the headers sent by the browser, you can find which header the server actually requires.
Related
I was checking the active links in a website with selenium web driver and java. I have passed the links to the array and while verifying I am getting the response as 403 forbidden for all links in the site. It is just a public website anyone can access. The links are working properly when clicking manually. I wanted to know Why it is not showing 200 and what can be done on this situation.
This is for Selenium webdriver with Java
for(int j=0;j< activelinks.size();j++) {
System.out.println("Active Link address and status >>> " + activelinks.get(j).getAttribute("href"));
HttpURLConnection connection = (HttpURLConnection)new URL(activelinks.get(j).getAttribute("href")).openConnection();
connection.connect();
String response = connection.getResponseMessage();
int responsecode = connection.getResponseCode();
connection.disconnect();
System.out.println(activelinks.get(j).getAttribute("href")+ ">>"+ response+ " " + responsecode);}
I expect the response code as 200, but the actual output is 403
I believe your need to add the relevant Cookies to the HTTPUrlConnection, or even better consider switching to OkHttp library which is under the hood of Selenium Java Client
So you basically need to fetch the cookies from the browser using driver.manage.getCookies() function and generate a proper Cookie request header for the subsequent calls.
Example code:
driver.manage().getCookies()
.forEach(cookie -> cookieBuilder
.append(cookie.getName())
.append("=")
.append(cookie.getValue())
.append(";"));
OkHttpClient client = new OkHttpClient().newBuilder().build();
for (WebElement activelink : activelinks) {
Request request = new Request.Builder()
.url(activelink.getAttribute("href"))
.addHeader("Cookie", cookieBuilder.toString())
.build();
Response urlResponse = client.newCall(request).execute();
String response = urlResponse.message();
int responsecode = urlResponse.code();
System.out.println(activelink.getAttribute("href") + ">>" + response + " " + responsecode);
}
If you need nothing else but response code you can consider using HEAD method to avoid executing calls for the full URLs - this will allow you to save traffic and your test will be much faster.
403 Forbidden
The HTTP 403 Forbidden client error status response code indicates that the server understood the request but refuses to authorize it.
This status is similar to 401, but in this case, re-authenticating will make no difference. The access is permanently forbidden and tied to the application logic, such as insufficient rights to a resource.
Reason
I don't see any such issue in your code block. However, there is a possibility that the WebDriver controlled Browser Client is getting detected and hence the subsequent requests are getting blocked and there can be numerous factors as follows:
User agent
Plugins
Languages
WebGL
Browser features
Missing image
You can find a couple of detailed discussion in:
How does recaptcha 3 know I'm using selenium/chromedriver?
Selenium and non-headless browser keeps asking for Captcha
Solution
A generic solution will be to use a proxy or rotating proxies from the Free Proxy List.
You can find a detailed discussion in Change proxy in chromedriver for scraping purposes
Outro
You can a couple relevant discussions in:
Can a website detect when you are using selenium with chromedriver?
Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
Failed to load resource: the server responded with a status of 429 (Too Many Requests) and 404 (Not Found) with ChromeDriver Chrome through Selenium
Had the same problem, user agent was the issue in my case (read more here: https://www.javacodegeeks.com/2018/05/how-to-handle-http-403-forbidden-error-in-java.html).
Also check what request methods are allowed on your website, you can do that by looking at any endpoint in "Network" tab in Chrome. It should list the allowed request methods, in my case I couldn't use "HEAD", but "GET" did the trick.
Code:
List<WebElement> links = driver.findElements(By.tagName("a"));
boolean brokenLink = false;
for (WebElement link : links) {
String url = link.getAttribute("href");
HttpURLConnection conn = (HttpURLConnection) new URL(url).openConnection();
conn.setRequestMethod("GET");
conn.setRequestProperty("Content-Type", "application/json");
conn.setRequestProperty("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36");
conn.connect();
int httpCode = conn.getResponseCode();
if (httpCode >= 400) {
System.out.println("BROKEN LINK: " + url + " " + httpCode);
brokenLink = true;
Assert.assertFalse(brokenLink);
}
else {
System.out.println("Working link: " + url + " " + httpCode);
}
}
I am trying to extract some business data from Facebook pages using VB.NET. However, I am not getting the response I would expect.
Dim request As HttpWebRequest
Dim response As HttpWebResponse
Dim responseText As String
request = CType(WebRequest.Create(http://www.facebook.com/Microsoft))
request.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)"
request.AllowAutoRedirect = True
response = CType(request.GetResponse(), HttpWebResponse)
If I look at the text for the response I get this:
<html><head><title>Redirecting...</title><script>__DEV__=0;_script_path = "XVanityURLController";var uri_re=/^(?:(?:[^:\/?#]+):)?(?:\/\/(?:[^\/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?/,target_domain='';window.location.href.replace(uri_re,function(a,b,c,d){var e,f,g;e=f=b+(c?'?'+c:'');if(d){d=d.replace(/^(!|%21)/,'');g=d.charAt(0);if(g=='/'||g=='\\')e=d.replace(/^[\\\/]+/,'/');}if(e!=f)window.location.replace(target_domain+e);});</script><script type="text/javascript">/*<![CDATA[*/(function(){function si_cj(m){setTimeout(function(){new Image().src="https:\/\/error.facebook.com\/common\/scribe_endpoint.php?c=si_clickjacking&t=956"+"&m="+m;},5000);}if(top!=self && !false){try{if(parent!=top){throw 1;}var si_cj_d=["apps.facebook.com","apps.beta.facebook.com"];var href=top.location.href.toLowerCase();for(var i=0;i<si_cj_d.length;i++){if (href.indexOf(si_cj_d[i])>=0){throw 1;}}si_cj("3 ");}catch(e){si_cj("1 \t");window.document.write("\u003Cstyle>body * {display:none !important;}\u003C\/style>\u003Ca href=\"#\" onclick=\"top.location.href=window.location.href\" style=\"display:block !important;padding:10px\">Go to Facebook.com\u003C\/a>");/*kSxhSBR_*/}}}())/*]]>*/</script><script>window.location.replace("https:\/\/m.facebook.com\/AMD");</script><meta http-equiv="refresh" content="0;url=https://m.facebook.com/AMD" /></head><body></body></html>
However, when I use a WebBrowser it actually redirects me to the Microsoft page. I don't want to use a form though to accomplish this.
So, I'm not sure how to bypass this redirect with HttpWebRequest. Do I need to somehow login to facebook in order to get the response I'm looking for? If so, how do I do this? Please help, I've been banging my head on this for days.
##
The page is using javascript to perform the redirect.
Your HttpResponse is getting the HTML returned as string but it does not execute the JavaScript inside of it.
Try looking into using a headless web browser, such as Selenium.
Regardless of whether I use WebClient or HttpWebRequest, loading this page times out. What am I doing wrong? It can't be https, since other https sites load just fine.
Below is my latest attempt, which adds all headers that I see in Firefox's inspector.
One interesting behavior is that I cannot monitor this with Fiddler, because everything works properly when Fiddler is running.
Using client As WebClient = New WebClient()
client.Headers(HttpRequestHeader.Accept) = "text/html, image/png, image/jpeg, image/gif, */*;q=0.1"
client.Headers(HttpRequestHeader.UserAgent) = "Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12"
client.Headers(HttpRequestHeader.AcceptLanguage) = "en-US;en;q=0.5"
client.Headers(HttpRequestHeader.AcceptEncoding) = "gzip, deflate, br"
client.Headers(HttpRequestHeader.Referer) = "http://www.torontohydro.com/sites/electricsystem/Pages/foryourhome.aspx"
client.Headers("DNT") = "1"
client.Headers(HttpRequestHeader.KeepAlive) = "keep-alive"
client.Headers(HttpRequestHeader.Upgrade) = "1"
client.Headers(HttpRequestHeader.CacheControl) = "max-age=0"
Dim x = New Uri("https://css.torontohydro.com/")
Dim data as string = client.DownloadString(x)
End Using
All of this is excess code. Boiling it down to just a couple of lines causes the same hang.
Using client as WebClient = New WebClient()
Dim data as string = client.DownloadString("https://css.torontohydro.com")
End Using
And this is the HttpWebRequest code, in a nutshell, which also hangs getting the response.
Dim getRequest As HttpWebRequest = CreateWebRequest("https://css.torontohydro.com/")
getRequest.CachePolicy = New Cache.RequestCachePolicy(Cache.RequestCacheLevel.BypassCache)
Using webResponse As HttpWebResponse = CType(getRequest.GetResponse(), HttpWebResponse)
'no need for any more code, since the above line is where things hang
So this ended up being due to the project still being in .NET 3.5. .NET was trying to load the site, being https, using SSL. Adding this line fixed the problem:
ServicePointManager.SecurityProtocol = 3072
I had to use 3072 since 3.5 does not contain a definition for SecurityProtocolType.Tls12.
I have a function that pulls and formats the source code of pages using a the VB webclient. I need a way to pull the source code of the page as though I were signed in on a browser.
I understand that I could use httpwebrequest in normal circumstances but this doesn't yield even the normal page, but one saying the browser is out of date. Even when I have used a new useragent. I believe it is related to the browser the request uses in VB.
I have been trying to do this using POST requests with the webclient but this doesn't work either. Below is the closest I have got.
Dim url As String = "URL HERE"
Dim xDoc As New XmlDocument
Dim s As String
Using client As New Net.WebClient
Dim reqparm As New Specialized.NameValueCollection
reqparm.Add("email", "EMAIL HERE")
reqparm.Add("pass", "PASSWORD HERE")
client.Headers("user-agent") = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
Dim responseBytes = client.UploadValues(url, "POST", reqparm)
Dim responsebody = (New Text.UTF8Encoding).GetString(responseBytes)
s = responsebody
End Using
I then proceed to output it to the next section of the program.
The attempt above just returns the normal source code. I'm guessing I'm completely missing how this works and how to implement it.
TL;DR:
Need to use vb webclient to pull source code of a page whilst acting like its signed in, but httpswebrequest is not an option.
Any help would be greatly appreciated.
I’m writing a http module to work as a reverse proxy, i.e. receives a request from a browser, sends it on to the target site, receives a response and sends that back to the browser.
Its all working fine, except for a problem with forwarding cookies from the browser request to the target site on a Post. All headers and form data are correct on the outgoing request, but no cookies are included.
I’ve run fiddler on both the request from the browser to IIS and the outgoing httpwebrequest and proven this to be the case. Running the module in debug shows that the cookies are found in the request from the browser and successfully placed in the cookiecontainer of the httpwebrequest, but they just don’t appear in the actual request sent out.
If I hack (in debug) the outgoing request method to a Get, then they go, but they don’t go for a Post.
I’ve also tracked the request/response from a browser direct to the target site using Fiddler, and the request seems identical in all three cases (browser to target, browser to my IIS module, IIS module to target), except that the IIS module to target omits the cookies.
Here’s the code (VB.Net, and tried in 2.0 and 4.5):
' set up the request to the target
Dim reqTarget As System.Net.HttpWebRequest
reqTarget = CType(System.Net.HttpWebRequest.Create(strTargetURL & strTargetPath & qstring), System.Net.HttpWebRequest)
' copy relevant info, cookies etc from the application request to the target request
CopyAppRequest(application.Context.Request, reqTarget)
' send the request and get the response
Dim rspTarget As System.Net.HttpWebResponse = CType(reqTarget.GetResponse(), System.Net.HttpWebResponse)
Private Sub CopyAppRequest(ByRef reqApp As System.Web.HttpRequest, ByRef reqTarget As System.Net.HttpWebRequest)
' copy over the headers
For Each key As String In reqApp.Headers.AllKeys
Select Case key
Case "Host", "Connection", "Content-Length", "Accept-Encoding", "Expect", "Authorization", "If-Modified-Since"
' not sure if we need to process these
Case "Connection"
reqTarget.Connection = reqApp.Headers(key)
Case "Content-Type"
reqTarget.ContentType = reqApp.Headers(key)
Case "Accept"
reqTarget.Accept = reqApp.Headers(key)
Case "Referer"
reqTarget.Referer = reqApp.Headers(key)
Case "User-Agent"
reqTarget.UserAgent = reqApp.Headers(key)
Case "Cookie"
' do nothing, cookies are handled below..
Case Else
reqTarget.Headers.Add(key, reqApp.Headers(key)
End Select
Next
reqTarget.Method = reqApp.HttpMethod
reqTarget.AllowAutoRedirect = False
If reqTarget.Method = "POST" Then
reqTarget.ContentLength = reqApp.ContentLength
Dim datastream() As Byte = System.Text.Encoding.UTF8.GetBytes(reqApp.Form.ToString)
reqTarget.ContentLength = datastream.Length
Dim requestwriter As System.IO.Stream = reqTarget.GetRequestStream
requestwriter.Write(datastream, 0, datastream.Length)
requestwriter.Close()
requestwriter.Dispose()
End If
Dim CookieJar As New System.Net.CookieContainer
reqTarget.CookieContainer = CookieJar
For Each key As String In reqApp.Cookies.AllKeys
Dim tgtCookie As New System.Net.Cookie
With tgtCookie
.Name = reqApp.Cookies.Item(key).Name
.Value = reqApp.Cookies.Item(key).Value
.Domain = ".domain.com"
.Path = "/"
.Expires = DateAdd(DateInterval.Month, 1, System.DateTime.Now)
.HttpOnly = True
End With
CookieJar.Add(tgtCookie)
Next
End Sub
Note: the domain I’m trying to reach is in the form abc.domain.com (i.e. it’s a subdomain, and no www), the reason I’ve tried the .domain.com form is that is the form used in the cookies that are received in the response. I’ve also tried other combinations such as abc.domain.com, .abc.domain.com, etc. Also I’ve tried creating a Uri object and using that method to add the cookie into the cookiecontainer.
I’ve tried everything I can think of and can find on forums…. Anyone got any suggestions? I suspect I’ve missed something obvious!
Of course, any other comments on how the code above can be improved will be appreciated.
Thanks.
Ok, I found the issue...
I was using Fiddler to see what was going on with http, and Fiddler was correct. However, when I used Wireshark, I found that the http request was being sent earlier than I thought, and only on the Post.
It turned out that the requestwriter.write line caused the http request to be sent, not the GetResponse (as is the case with Get). So, anything I changed in the httpwebrequest after the requestwriter.write didn't get sent.
The fix - I just moved all the header and cookie set up above the requestwriter.write and it all worked.
How frustrating, but at least its fixed now :)
If anyone has any feedback on whether I've got something wrong that's causing this to happen, please let me know.