Web scraping using VB.NET - vb.net

I have a url:
test.com/Search/NumberSearch.aspx
On the page there are a number of controls, one of them is textbox. When the user enters a six digit (approximately) number into the textbox and hits enter, the page goes to another page:
test.com/Data/DetailsPage.aspx?mynum=123456
on that page there are a number of textboxes and other controls from which I need to scrape the data in addition to a number of links that I need to caputure in my code.
I have tried using VB.NET WebRequest:
Dim wreq As WebRequest = WebRequest.Create("test.com/Data/DetailsPage.aspx?mynum=" & num)
Dim wresp As HttpWebResponse = CType(wreq.GetResponse(), HttpWebResponse)
Dim dStream As Stream = wresp.GetResponseStream()
Dim rdr As New StreamReader(dStream)
Dim respStr As String = rdr.ReadToEnd()
As a result my respStr contains a string with html code but that code is for
test.com/Search/NumberSearch.aspx
not for the resulting
test.com/Data/DetailsPage.aspx?mynum=123456
page with details.
My goal is to get the details page html programmatically.
I also tried using
WebClient.DownloadString
but gotten the same result. Can anyone help?

I would try setting the User-Agent header, as that is what many sites key off of:
Dim wreq As WebRequest = WebRequest.Create("test.com/Data/DetailsPage.aspx?mynum=" & num)
wreq.Headers.Add(HttpRequestHeader.UserAgent, "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36")
Dim wresp As HttpWebResponse = CType(wreq.GetResponse(), HttpWebResponse)
Dim dStream As Stream = wresp.GetResponseStream()
Dim rdr As New StreamReader(dStream)
Dim respStr As String = rdr.ReadToEnd()

Related

Post Data for a site in httpwebrequest

so i am trying to login to a website using httpwebrequest. the post data i got from a http debugger is
code i am trying is:
Dim postData As String = "securitycheck=85b39cc89f04bc1612ce9d0c384b39ca&do_action=log_into_system&jump_to=https%3A%2F%2Fwww.dreamstime.com%2F&uname=jawademail&pass=jawadpass"
Dim tempCookies As New CookieContainer
Dim encoding As New UTF8Encoding
Dim byteData As Byte() = encoding.GetBytes(postData)
Dim postReq As HttpWebRequest = DirectCast(WebRequest.Create("https://www.dreamstime.com/securelogin.php"), HttpWebRequest)
postReq.Method = "POST"
postReq.KeepAlive = True
postReq.CookieContainer = tempCookies
postReq.ContentType = "application/x-www-form-urlencoded"
postReq.Referer = "https://www.dreamstime.com/login.php"
postReq.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.3) Gecko/20100401 Firefox/4.0 (.NET CLR 3.5.30729)"
postReq.ContentLength = byteData.Length
Dim postreqstream As Stream = postReq.GetRequestStream()
postreqstream.Write(byteData, 0, byteData.Length)
postreqstream.Close()
Dim postresponse As HttpWebResponse
postresponse = DirectCast(postReq.GetResponse(), HttpWebResponse)
tempCookies.Add(postresponse.Cookies)
logincookie = tempCookies
Dim postreqreader As New StreamReader(postresponse.GetResponseStream())
Dim thepage As String = postreqreader.ReadToEnd
RichTextBox1.Text = thepage
thsi code does not seem to post data in website i get referer page code in richtextbox after running the code.
First GET the page, find the "securitycheck" in its source and extract it.
Combine it with the rest of your data then send it with POST.
Ok so I felt like trying:
Dim LoginData As String
Dim LoginCookies As New CookieContainer() 'Move this outside of sub/function so you can use it later
Dim LoginRequest As HttpWebRequest = WebRequest.Create("https://www.dreamstime.com/login.php")
LoginRequest.CookieContainer = LoginCookies
LoginRequest.KeepAlive = True
LoginRequest.AllowAutoRedirect = True
LoginRequest.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0"
Dim LoginResponse As HttpWebResponse = LoginRequest.GetResponse()
Dim LoginResponseRead As StreamReader = New StreamReader(LoginResponse.GetResponseStream())
Using LoginResponseRead
Do
Dim line As String = LoginResponseRead.ReadLine
If line.Contains("var securitycheck=") Then
LoginData = "securitycheck=" & line.Substring(line.IndexOf("=") + 2, line.LastIndexOf("'") - line.IndexOf("=") - 2)
Exit Do
End If
Loop
End Using
Dim byteData As Byte() = Encoding.UTF8.GetBytes(LoginData)
LoginRequest = WebRequest.Create("https://www.dreamstime.com/securelogin.php")
LoginRequest.CookieContainer = LoginCookies
LoginRequest.Method = "POST"
LoginRequest.KeepAlive = True
LoginRequest.ContentType = "application/x-www-form-urlencoded"
LoginRequest.Referer = "https://www.dreamstime.com/login.php"
LoginRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.3) Gecko/20100401 Firefox/4.0 (.NET CLR 3.5.30729)"
LoginRequest.ContentLength = byteData.Length
Dim postreqstream As Stream = LoginRequest.GetRequestStream()
postreqstream.Write(byteData, 0, byteData.Length)
postreqstream.Close()
LoginResponse = LoginRequest.GetResponse()
LoginResponseRead = New StreamReader(LoginResponse.GetResponseStream())
Dim thepage As String = LoginResponseRead.ReadToEnd
'Now with GET request grab whatever you want, DON'T forget to use cookie.
Result
>>>securitycheck=183d5abdb01f288aacbe5b2893555ec5
Dim email As String = "something"
Dim password As String = "somethingelse"
LoginData &= "&do_action=log_into_system&jump_to=https%3A%2F%2Fwww.dreamstime.com%2F&uname=" & email & "&pass=" & password
>>>securitycheck=183d5abdb01f288aacbe5b2893555ec5&do_action=log_into_system&jump_to=https%3A%2F%2Fwww.dreamstime.com%2F&uname=something&pass=somethingelse
There, practically done.
I took a look at this myself and it appears that with every login request a token is sent that identifies your "session", specifically:
securitycheck=85b39cc89f04bc1612ce9d0c384b39ca
This token changes every time you login, and if it isn't valid the site redirects you back to the login page, asking you to login again.
Sites usually do this to prevent Cross-Site Request Forgery (CSRF). This means that you will most likely not be able to login to this site without using an actual web browser.
Here's the code that was tested and it works. It uses System.Net.Http.HttpClient rather WebClient (since it supports concurrent requests). This code is just a model since its main goal is to show the idea how to work with this site. There are additional explanations in comments. You also need to import System.Web dll.
Imports System.Net.Http
Imports System.Web
Imports System.Text.RegularExpressions
Public Class TestForm
Private Const URL_MAIN$ = "https://www.dreamstime.com"
Private Const URL_LOGIN$ = "https://www.dreamstime.com/securelogin.php"
Private Const URL_LOGOUT$ = "https://www.dreamstime.com/logout.php "
Private Const USER_AGENT$ = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " +
"AppleWebKit/537.36 (KHTML, Like Gecko) " +
"Chrome/68.0.3440.15 Safari/537.36 OPR/55.0.2991.0 " +
"(Edition developer)"
Private Const LOGIN$ = "<USER_NAME>"
Private Const PASS$ = "<USER_PASSWORD>"
Private token$
Private Async Sub OnGo() Handles btnGo.Click
Dim html$
Using client = New HttpClient()
client.DefaultRequestHeaders.Add("User-Agent", USER_AGENT)
Using req = New HttpRequestMessage(HttpMethod.Get, URL_MAIN)
Using resp = Await client.SendAsync(req)
html = Await resp.Content.ReadAsStringAsync()
End Using
End Using
'// Search for security token
Dim m = Regex.Match(
html,
"<input type=""hidden"" name=""securitycheck"" value=""(?'token'\w+)"">")
If Not m.Success Then
MessageBox.Show("Could not find security token.")
Return
End If
'// Get security token
token = m.Groups("token").Value
'// Try to login.
'// For logging to work, we need to use FormUrlEncodedContent class.
'// Also we need to use it every time we do POST requests.
'// No need for it for GET requests (as long as the HttpClient is the same).
Using req = New HttpRequestMessage(HttpMethod.Post, URL_LOGIN) With
{
.Content = GetFormData()
}
Using resp = Await client.SendAsync(req)
html = Await resp.Content.ReadAsStringAsync()
End Using
End Using
'// Go to main page to check we're logged in.
'// "html" variable now MUST contain user's account name.
Using req = New HttpRequestMessage(HttpMethod.Get, URL_MAIN$)
Using resp = Await client.SendAsync(req)
html = Await resp.Content.ReadAsStringAsync()
End Using
End Using
'// Logout.
'// "html" variable now MUST NOT contain user's account name.
Using req = New HttpRequestMessage(HttpMethod.Get, URL_LOGOUT)
Using resp = Await client.SendAsync(req)
html = Await resp.Content.ReadAsStringAsync()
End Using
End Using
End Using
End Sub
Function GetFormData() As FormUrlEncodedContent
Return New FormUrlEncodedContent(New Dictionary(Of String, String) From
{
{"securitycheck", token},
{"do_action", "log_into_system"},
{"jump_to", ""},
{"uname", HttpUtility.HtmlEncode(LOGIN)},
{"pass", HttpUtility.HtmlEncode(PASS)}
})
End Function
End Class

Web scraping with a pop-up (Visual Basic)

I'm trying to scrape a page, but when I login the page displays a pop up before the page that I need (Welcome to blah-blah-blah...don't hit refresh as it will slow the process...etc.etc...).
Naturally, HttpWebRequest scrapes THIS data and not the page that follows.
The popup self cancels so if I could just get the HttpWebRequest to wait a second or two and then scrape, it would work - or - if I can get it to do two scrapes (and I simply discard the 1st one) in the same session then that would work too.
Here's the code:
Dim CookieJar As New CookieContainer
Dim Request As HttpWebRequest = WebRequest.CreateHttp(TextBox1.Text)
Request.CookieContainer = New CookieContainer()
Request.CookieContainer.Add(New Uri(TextBox1.Text),
New Cookie("id", "1234"))
Request.PreAuthenticate = True
Request.Credentials = CredentialCache.DefaultCredentials
Request.UserAgent = "User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64)"
Request.AllowAutoRedirect = True
Request.MaximumAutomaticRedirections = 4
Request.MaximumResponseHeadersLength = 4
Dim Response As WebResponse = DirectCast(Request.GetResponse(), HttpWebResponse)
Dim WebResult As String = New StreamReader(Response.GetResponseStream()).ReadToEnd()
TextBox2.Text = WebResult
Thanks in advance for any suggestions.

GetRequestStream causes The connection was closed unexpectedly

i have a vb.net console application that logged into a website (POST form) by using Webclient:
Dim responsebytes = myWebClient.UploadValues("https:!!xxx.com/mysession/create", "POST", myNameValueCollection)
Last friday this suddenly stopped working, it worked without a problem for about 2-3 years. With Fiddler I got a HTTP 504 error but without Fiddler I got the error message:
The underlying connection was closed: The connection was closed unexpectedly.
I assume that something on the server-side has changed, but I have no influence on that. It's a commercial website, where I want to login automatically on my account to fetch some data.
As Fiddler can't help me much further I decided to built a basic HttpWebRequest example to rule out it was caused by the WebClient.
The example does:
navigate to the homepage of the company and read out an securityToken (this goes ok!)
post the securityToken + username + password to get logged in.
Public Class Form1
Const ConnectURL = "https:!!member.company.com/homepage/index"
Const LoginURL = "https:!!member.company.com/account/logn"
Private Function RegularPage(ByVal URL As String, ByVal CookieJar As CookieContainer) As String
Dim reader As StreamReader
Dim Request As HttpWebRequest = HttpWebRequest.Create(URL)
Request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36"
Request.AllowAutoRedirect = False
Request.CookieContainer = CookieJar
Dim Response As HttpWebResponse = Request.GetResponse()
reader = New StreamReader(Response.GetResponseStream())
Return reader.ReadToEnd()
reader.Close()
Response.Close()
End Function
Private Function LogonPage(ByVal URL As String, ByRef CookieJar As CookieContainer, ByVal PostData As String) As String
Dim reader As StreamReader
Dim Request As HttpWebRequest = HttpWebRequest.Create(URL)
Request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36"
Request.CookieContainer = CookieJar
Request.AllowAutoRedirect = False
Request.ContentType = "application/x-www-form-urlencoded"
Request.Method = "POST"
Request.ContentLength = PostData.Length
Dim requestStream As Stream = Request.GetRequestStream()
Dim postBytes As Byte() = Encoding.ASCII.GetBytes(PostData)
requestStream.Write(postBytes, 0, postBytes.Length)
requestStream.Close()
Dim Response As HttpWebResponse = Request.GetResponse()
For Each tempCookie In Response.Cookies
CookieJar.Add(tempCookie)
Next
reader = New StreamReader(Response.GetResponseStream())
Return reader.ReadToEnd()
reader.Close()
Response.Close()
End Function
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
Dim CookieJar As New CookieContainer
Dim PostData As String
Try
Dim homePage As String = (RegularPage(ConnectURL, CookieJar))
Dim securityToken = homePage.Substring(homePage.IndexOf("securityToken") + 22, 36) 'momenteel 36 characters lang
PostData = "securityToken=" + securityToken + "&accountId=123456789&password=mypassword"
MsgBox(PostData)
Dim accountPage As String = (LogonPage(LoginURL, CookieJar, PostData))
Catch ex As Exception
MsgBox(ex.Message.ToString)
End Try
End Sub
End Class
This line causes the connection to be closed:
Dim requestStream As Stream = Request.GetRequestStream()
Is it possible that this company doesnt like the automated login and somehow notices that a application is used for logging in? How can I debug this? Fiddler doesn't seem to work. Is my only option WireShark as this seems kind of difficult to me.
Also is it weird that the connection is already is closed before I do the Post?
Are there other languages I can program this "easily" to rule out it's VB.net / .NET problem?
Have you attempted to capture the request using something like your browser's networking tools?
The auth process may have changed. Could even be some name or post data changes.
I got this fixed by:
double checking all the headers to be sent when using a browser
made sure all those headers where sent by the VB.NET application.
Not sure which one did the trick, but just always make sure you replicate all the headers that the browser would sent!

Website rejecting HttpWebRequest in Visual Basic

I tried to get the webpage source of pages under www.digikey.com
It was working fine long back but now the website is rejecting the request, but it shows no problem in web browser , issue happens when trying to access via code and when i examine the page sorce received now its written
There was a problem with your request.
We are unable to process your request.
Please return to the previous page to try again or contact Digi-Key Webmaster if you feel that you have received this message in error. Please reference the following incident number so we may assist you with this error.
The code i used was:
Dim PartURL As String
PartURL = "http://www.digikey.com"
Dim request As System.Net.HttpWebRequest = System.Net.HttpWebRequest.Create(PartURL)
Dim response As System.Net.HttpWebResponse = request.GetResponse()
Dim sr As System.IO.StreamReader = New System.IO.StreamReader(response.GetResponseStream())
Dim sourcecode As String = sr.ReadToEnd()
when i changed the url to google.com its working well.
The website www.digikey.com works without a problem in web browsers. It shows error only when trying to access via the code. So i thought it may have something in relation with the code. Is it because digikey is rejecting the request. Is there any way i can get the source code of pages under www.digikey.com
Try this:
Dim PartURL As String
PartURL = "http://www.digikey.com"
Dim request As System.Net.HttpWebRequest = System.Net.HttpWebRequest.Create(PartURL)
request.UserAgent = "Mozilla/5.0 (Windows NT 5.1; rv:2.0b8) Gecko/20100101 Firefox/4.0b8"
request.Accept = "Accept: text/html,application/xhtml+xml,application/xml"
Dim response As System.Net.HttpWebResponse = request.GetResponse()
Dim sr As System.IO.StreamReader = New System.IO.StreamReader(response.GetResponseStream())
Dim sourcecode As String = sr.ReadToEnd()

HTTPWebRequest Login POST is not Redirecting

I need to use HTTPWebRequest to login to an external website and redirect me to the default page. My code below is behind a button - when clicked it currently tries to do some processing but stays on the same page. I need it to redirect me to the default page of the external website without seeing the login page. Any help on what I'm doing wrong?
Dim loginURL As String = "https://www.example.com/login.aspx"
Dim cookies As CookieContainer = New CookieContainer
Dim myRequest As HttpWebRequest = CType(WebRequest.Create(loginURL), HttpWebRequest)
myRequest.CookieContainer = cookies
myRequest.AllowAutoRedirect = True
myRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20100101 Firefox/13.0.1"
Dim myResponse As HttpWebResponse = CType(myRequest.GetResponse(), HttpWebResponse)
Dim responseReader As StreamReader
responseReader = New StreamReader(myResponse.GetResponseStream())
Dim responseData As String = responseReader.ReadToEnd()
responseReader.Close()
'call a function to extract the viewstate needed to login
Dim ViewState As String = ExtractViewState(responseData)
Dim postData As String = String.Format("__VIEWSTATE={0}&txtUsername={1}&txtPassword={2}&btnLogin.x=27&btnLogin.y=9", ViewState, "username", "password")
Dim encoding As UTF8Encoding = New UTF8Encoding()
Dim data As Byte() = encoding.GetBytes(postData)
'POST to login page
Dim postRequest As HttpWebRequest = CType(WebRequest.Create(loginURL), HttpWebRequest)
postRequest.Method = "POST"
postRequest.AllowAutoRedirect = True
postRequest.ContentLength = data.Length
postRequest.CookieContainer = cookies
postRequest.ContentType = "application/x-www-form-urlencoded"
postRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20100101 Firefox/13.0.1"
Dim newStream = postRequest.GetRequestStream()
newStream.Write(data, 0, data.Length)
newStream.Close()
Dim postResponse As HttpWebResponse = CType(postRequest.GetResponse(), HttpWebResponse)
'using GET request on default page
Dim getRequest As HttpWebRequest = CType(WebRequest.Create("https://www.example.com/default.aspx"), HttpWebRequest)
getRequest.CookieContainer = cookies
getRequest.AllowAutoRedirect = True
Dim getResponse As HttpWebResponse = CType(getRequest.GetResponse(), HttpWebResponse)
'returns statuscode = 200
FYI - when i add in this code at the end, i get the HTML of the default page I'm trying to redirect to
Dim responseReader1 As StreamReader
responseReader1 = New StreamReader(getRequest.GetResponse().GetResponseStream())
responseData = responseReader1.ReadToEnd()
responseReader1.Close()
Response.Write(responseData)
Any help on whats missing to get the redirect working?
Cheers
The HttpWebRequest only automatically redirects you if the server sends an HTTP 3xx redirection status with a Location field in the response. Otherwise you are supposed to manually navigate to the page by using Response.Redirect, for example. Also keep in mind that the automatic redirection IGNORES ANY COOKIES sent by the server. That may be the problem in your case if the server is actually sending a redirection status.