HtmlAgilityPack returning random characters - httpwebrequest

I have the following code that is using HtmlAgilityPack to pull back html code for a number of websites. All seems to be working well, apart from asos.com. When running a url through, it returns random characters (‹\b\0\0\0\0\0\0UÍ „ï&¾CãÁ¢ø›\bãhìÁ3-«Ziý}z‘š/»ómf³Ü`]In#iÉÑbr[œ¡Ä¬v7Ðœ¶7N[GáôSv;Ü°?[†.ã*3Ž¢G×ù6OƒäwPŒõH\rÙ¸\vzìmèÎ;M›4q_K¨Ð)
HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
doc.OptionReadEncoding = false;
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("http://www.asos.com/ASOS/ASOS-Sweatshirt-With-Contrast-Ribs/Prod/pgeproduct.aspx?iid=2765751&cid=14368&sh=0&pge=0&pgesize=20&sort=-1&clr=Red");
request.Timeout = 10000;
request.ReadWriteTimeout = 32000;
request.UserAgent = "TEST";
request.Method = "GET";
request.Accept = "text/html";
request.AllowAutoRedirect = false;
request.CookieContainer = new CookieContainer();
StreamReader reader = new StreamReader(request.GetResponse().GetResponseStream(), Encoding.Default); //put your encoding
doc.Load(reader);
string html = doc.DocumentNode.OuterHtml;
I have ran the url through Fiddler, however cant seem to see anything to suggest there should be a problem. Any ideas where i'm going wrong?
See header image from fiddler here: http://i.stack.imgur.com/2LRFY.png

This has nothing to do with Html Agility Pack, it's because you have set AllowAutoRedirect to false. Remove it and it will work. The site apparently does a redirect, you need to follow it if you want the final HTML text.
Note the Html Agility Pack has a utility HtmlWeb class that can download file directly as an HmlDocument:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(#"http://www.asos.com/ASOS/ASOS-Sweatshirt-With-Contrast-Ribs/Prod/pgeproduct.aspx?iid=2765751&cid=14368&sh=0&pge=0&pgesize=20&sort=-1&clr=Red");

Related

How to read a div details with / without scrapping web page which is not present in source code in java?

I have a use case where I want to read the version of the published extension on edge store.
The link of any published extension is as follows -> https://microsoftedge.microsoft.com/addons/detail/incognito-adblocker/efpgcmfgkpmogadebodiegjleafcmdcb
Now Here the problem I am facing is that the span where the version is location. ( Span ID is "versionLabel" ), has a parent div called "root". Now if we inspect it and check we can see all the children divs of this "root" div. But if we see the source of this page ( Ctrl + U ). This div always shows up empty with no details.
<div id="root" style="min-height: 100vh"></div>
I am using Jsoup to parse this page and get this details but because this div "root" is empty. I can not able to read this "verisonLabel" details. Is there any way to do this ?
Please refer the ways I have already tried but none worked.
1.
String URL = "https://microsoftedge.microsoft.com/addons/detail/incognito-adblocker/efpgcmfgkpmogadebodiegjleafcmdcb";
Document doc = Jsoup.connect(URL).get();
Element version = doc.getElementById("versionLabel");
Document demo = Jsoup.parse(URL);
Element newHere = demo.getElementById("versionLabel");
WebDriver driver = new ChromeDriver();
driver.get(URL);
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
WebElement e = driver.findElement(By.xpath("//*[text()='Get started free']"));
System.out.println(e);
String webpage = "https://microsoftedge.microsoft.com/addons/detail/incognito-adblocker/efpgcmfgkpmogadebodiegjleafcmdcb";
URL url = new URL(webpage);
BufferedReader readr =
new BufferedReader(new InputStreamReader(url.openStream()));
// Enter filename in which you want to download
BufferedWriter writer =
new BufferedWriter(new FileWriter("Download.html"));
// read each line from stream till end
String line;
while ((line = readr.readLine()) != null) {
writer.write(line);
}
readr.close();
writer.close();
In each of this ways, because the "root" div itself is empty, I am not able to read the "versionLabel" span.
Can someeone suggest some way here ?
This will get the version from the 'versionLabel':
driver.find_element(By.XPATH, "(//span[#id='versionLabel'])[2]").text

VB.Net Webview2 How can I get html source code?

I sucessfully display a web site on WebView2 in my VB.net (Visual Studio 2017) project but can not get html souce code. Please advise me how to get html code.
My code:
Private Sub testbtn_Click(sender As Object, e As EventArgs) Handles testbtn.Click
WebView2.CoreWebView2.Navigate("https://www.microsoft.com/")
End Sub
Private Sub WebView2_NavigationCompleted(sender As Object, e As CoreWebView2NavigationCompletedEventArgs) Handles WebView2.NavigationCompleted
Dim html As String = ?????
End Sub
Thank you indeed for your advise in advance.
I've only just started messing with the WebView2 earlier today as well, and was just looking for this same thing. I did manage to scrape together this solution:
Dim html As String
html = Await WebView2.ExecuteScriptAsync("document.documentElement.outerHTML;")
' The Html comes back with unicode character codes, other escaped characters, and
' wrapped in double quotes, so I'm using this code to clean it up for what I'm doing.
html = Regex.Unescape(html)
html = html.Remove(0, 1)
html = html.Remove(html.Length - 1, 1)
Converted my code from C# to VB on the fly, so hopefully didn't miss any syntax errors.
Adding to #Xaviorq8 answer, you can use Span to get rid of generating new strings with Remove:
html = Regex.Unescape(html)
html = html.AsSpan()[1..^1].ToString();
I must credit #Xaviorq8; his answer was needed to solve my problem.
I was successfully using .NET WebBrowser and Html Agility Pack but I wanted to replace WebBrowser with .NET WebView2.
Snippet (working code with WebBrowser):
using HAP = HtmlAgilityPack;
HAP.HtmlDocument hapHtmlDocument = null;
hapHtmlDocument = new HAP.HtmlDocument();
hapHtmlDocument.Load(webBrowser1.DocumentStream);
HtmlNodeCollection nodes = hapHtmlDocument.DocumentNode.SelectNodes("//*[#id=\"apptAndReportsTbl\"]");
Snippet (failing code with WebView2):
using HAP = HtmlAgilityPack;
HAP.HtmlDocument hapHtmlDocument = null;
string html = await webView21.ExecuteScriptAsync("document.documentElement.outerHTML");
hapHtmlDocument = new HAP.HtmlDocument();
hapHtmlDocument.LoadHtml(html);
HtmlNodeCollection nodes = hapHtmlDocument.DocumentNode.SelectNodes("//*[#id=\"apptAndReportsTbl\"]");
Success withWebView2 and Html Agility Pack
using HAP = HtmlAgilityPack;
HAP.HtmlDocument hapHtmlDocument = null;
string html = await webView21.ExecuteScriptAsync("document.documentElement.outerHTML");
// thanks to #Xaviorq8 answer (next 3 lines)
html = Regex.Unescape(html);
html = html.Remove(0, 1);
html = html.Remove(html.Length - 1, 1);
hapHtmlDocument = new HAP.HtmlDocument();
hapHtmlDocument.LoadHtml(html);
HtmlNodeCollection nodes = hapHtmlDocument.DocumentNode.SelectNodes("//*[#id=\"apptAndReportsTbl\"]");
The accepted answer is on the right track. However, it's missing on important thing:
The returned string is NOT HTMLEncoded, it's JSON!
So to do it right, you need to deserialize the JSON, which is just as simple:
Dim html As String
html = Await WebView2.ExecuteScriptAsync("document.documentElement.outerHTML;")
html = Await JsonSerializer.DeserializeAsync(Of String)(html);

It is possible to get that captcha image?

I am wondering if it is possible to obtain captcha image from site using web request. The thing is, that this captcha image is changing every time I refresh the page, for example:
http://zapytaj.onet.pl/captcha/obrazek.jpg?6165303
As you can notice, if you refresh the page url stays the same, but the code is changed.
The implementation of captcha code on page is:
<img id="captcha-image" src="/captcha/obrazek.jpg?5842454"/>
<div>
Complete captcha (Refresh)
<input type="text" name="captcha_code" class="text">
Url link to every captcha is changing every time I refresh to. So even if I get the captcha url it will not work, because the image is randomly generated.
I managed something like this:
Dim dt1970 As DateTime = New DateTime(1970, 1, 1)
Dim current As Date = DateTime.Now
Dim span As TimeSpan = current - dt1970
Dim requestUriString As String = "http://zapytaj.onet.pl/captcha/obrazek.jpg?" + (Convert.ToInt64(DateTime.Now) / span.TotalMilliseconds).ToString() + ""
Dim httpWebRequest As HttpWebRequest = CType(WebRequest.Create(requestUriString), HttpWebRequest)
httpWebRequest.CookieContainer = Me.cookieJar(4)
Dim httpWebResponse As HttpWebResponse = CType(httpWebRequest.GetResponse(), HttpWebResponse)
Dim image As Image = Image.FromStream(httpWebResponse.GetResponseStream())
httpWebResponse.Close()
Me.PictureBox1.SizeMode = PictureBoxSizeMode.StretchImage
Me.PictureBox1.Image = image
I should also add the bitwise AND operation, but I think that would not work anyway. I think because of differences in DateTime values between my app and website.
If I take only the http://zapytaj.onet.pl/captcha/obrazek.jpg as requestUriString I got the captcha image, but as I said its not that one.
The url with captcha image:
http://zapytaj.onet.pl/register.html#register-email-form

phantomjs to render a page from a string

I want to render a webpage from a string. I've looked at the docs of phantomjs and they suggested the following:
var webPage = require('webpage');
var page = webPage.create();
var expectedContent = '<html><body><div>Test div</div></body></html>';
var expectedLocation = 'http://www.phantomjs.org/';
page.setContent(expectedContent, expectedLocation);
It's not quite working. Why? (I use the latest version).
I suggest you render a normal page (about:blank works) and then do webPage.content='<html><body><div>Test div</div></body></html>';
then render your page.
hope that helps.

Contacting another website via program?

I know the title isn't very elaborate, but I have tried this multiple times (to figure out how) but I could never find out how to do so. I want to do stuff like upload a "paste" to pastebin.com, upload a picture to twitpic.com, upload a file to rapidshare.com, etcetera.
How would I do so? Thanks!
(Visual Basic 2010 Express | Windows 7 Ultimate)
I conscious that Visual Basic 2010 express will have some way to interact with the server side.
If you couldn't find you need to change the language.
To post in twitpic you need to use their API givne in the following URL.
http://twitpic.com/api.do
let's say
<form action="http://twitpic.com/api/uploadAndPost">
<input name="media"></input>
<input name="username"></input>
<input name="password"></input>
<input name="message"></input>
</form>
It depends. You can just do a cross-domain form submission (set the action to a page on another domain), or you can do server-server communication, or you can use JSONP (JSON wrapped in a function call).
VB.NET code for a Pastebin submission is:
Dim req As HttpWebRequest = DirectCast(WebRequest.Create("http://pastebin.com/api_public.php"), HttpWebRequest)
req.ContentType = "application/x-www-form-urlencoded"
req.Method = "POST"
Dim postData As String = "paste_code=Simple Example"
Dim postBytes As Byte() = Encoding.UTF8.GetBytes(postData)
req.ContentLength = postBytes.Length
Dim reqStream As Stream = req.GetRequestStream()
reqStream.Write(postBytes, 0, postBytes.Length)
reqStream.Close()
Dim resp As HttpWebResponse = DirectCast(req.GetResponse(), HttpWebResponse)
Dim respText As String = New StreamReader(resp.GetResponseStream(), Encoding.UTF8).ReadToEnd()
respText is the generated paste bin URL. This can obviously be improved. It's an initial demonstration.