Scan hundreds of website using HttpWebRequest or similar - vb.net

I have VB code with HttpWebRequest that collects html of hundreds of websites but takes very long time to complete the task. Code basically is a for-to-loop and reads html of the each website in the listbox. In a loop, the extracted html of each website is searched for specific words. I want to display list of website that has word under each word column.
For Each webAddr As String In lstbox.Items
strHtml = Make_A_Call(webAddr)
If strHtml.Contains("Keyword1") Then
..........
End If
If strHtml.Contains("Keyword2") Then
..........
End If
..........
..........
..........
..........
..........
Next
Private Function Make_A_Call(ByVal strURL As String) As String
Dim strResult As String
Dim wbrq As HttpWebRequest
Dim wbrs As HttpWebResponse
Dim sr As StreamReader
Try
strResult = ""
wbrq = WebRequest.Create(strURL)
wbrq.Method = "GET"
' Read the returned data
wbrs = wbrq.GetResponse
sr = New StreamReader(wbrs.GetResponseStream)
strResult = sr.ReadToEnd.Trim
sr.Close()
sr.Dispose()
wbrs.Close()
Catch ex As Exception
ErrMessage.Text = ex.Message.ToString
ErrMessage.ForeColor = Color.Red
End Try
Return strResult
End Function
Compiled code takes almost 5 minutes to complete the loop. Some times it fails to complete. Can it be modified to impove the performance. Please, help with better code and suggestions.

Remember, there are two separate bottlenecks:
Bandwidth to download the HTML
CPU processing
You can't necessarily speed up the downloading using parallel processing; that can only be helped by buying more bandwidth. What you can do, though, is ensure that the downloading and processing are done on separate threads. I'd suggest doing the following:
Use BackgroundWorker instances to download the data.
In the work completed callback, first fire off the next Background Worker, then process the result of the existing worker (the keyword search).

Related

Google searching URL and add Listbox C# vb.net

i want to using webrequest codes then adding google search result URL's listbox1. But i can't codes gives error.
Try
Dim adres As String = "https://www.google.com/search?q=" + TextBox1.Text
Dim istek As WebRequest = HttpWebRequest.Create(adres)
Dim cevap As WebResponse
cevap = istek.GetResponse()
Dim donenBilgiler As StreamReader = New StreamReader(cevap.GetResponseStream())
Dim gelen As String = donenBilgiler.ReadToEnd()
Dim titleIndexBaslangici As Integer = gelen.IndexOf("<link href") + 2
Dim titleIndexBitisi As Integer = gelen.Substring(titleIndexBaslangici).IndexOf(">")
ListBox1.Items.Add(gelen.Substring(titleIndexBaslangici, titleIndexBitisi))
Catch ex As Exception
MsgBox(ex.Message)
End Try
First, are you familliar with API, because that's what you need ! You might work this out with the way you want to make it but its really bad and no one will recommend you to continue like this!
What you need to look for is API, (google search API)! API "only" purpose is to access to some data (in database) with easy Http route (well documented), try it out!
If you keep trying to do it in your own way, the best result that you will get is a really bad html page that you will need to parse and you don't want that !

non blocking webrequests vb.net

I am making a program that must process about 5000 strings as quickly as possible. about 2000 of these strings must be translated via a webrequest to mymemory.translated.net. (see code below, JSON part removed since not needed here)
Try
url = "http://api.mymemory.translated.net/get?q=" & Firstpart & "!&langpair=de|it&de=somemail#christmas.com"
request = DirectCast(WebRequest.Create(url), HttpWebRequest)
response = DirectCast(request.GetResponse(), HttpWebResponse)
myreader = New StreamReader(response.GetResponseStream())
Dim rawresp As String
rawresp = myreader.ReadToEnd()
Debug.WriteLine("Raw:" & rawresp)
Catch ex As Exception
MessageBox.Show(ex.ToString)
End Try
the code itself is working fine, problem is it is a blocking code and needs about 1 second per string. Thats more then half an hour for all my strings. i would need to convert this code to a non blocking one and make multiple calls on the same time. Could somebody please tell me how i could do that? I was thinking of a background worker but that wouldnt speed things up.. it would just execute the code on a different thread...
thanks!
The problem is you aren't just being held back by the maximum number of concurrent operations. HttpWebRequests are throttled by nature (I believe the default policy allows only 2 at any given time), so you have to override that behaviour too. Please refer to the code below.
Imports System.Diagnostics
Imports System.IO
Imports System.Net
Imports System.Threading
Imports System.Threading.Tasks
Public Class Form1
''' <summary>
''' Test entry point.
''' </summary>
Private Sub Form1_Load() Handles MyBase.Load
' Generate enough words for us to test thoroughput.
Dim words = Enumerable.Range(1, 100) _
.Select(Function(i) "Word" + i.ToString()) _
.ToArray()
' Maximum theoretical number of concurrent requests.
Dim maxDegreeOfParallelism = 24
Dim sw = Stopwatch.StartNew()
' Capture information regarding current SynchronizationContext
' so that we can perform thread marshalling later on.
Dim uiScheduler = TaskScheduler.FromCurrentSynchronizationContext()
Dim uiFactory = New TaskFactory(uiScheduler)
Dim transformTask = Task.Factory.StartNew(
Sub()
' Apply the transformation in parallel.
' Parallel.ForEach implements clever load
' balancing, so, since each request won't
' be doing much CPU work, it will spawn
' many parallel streams - likely more than
' the number of CPUs available.
Parallel.ForEach(words, New ParallelOptions With {.MaxDegreeOfParallelism = maxDegreeOfParallelism},
Sub(word)
' We are running on a thread pool thread now.
' Be careful not to access any UI until we hit
' uiFactory.StartNew(...)
' Perform transformation.
Dim url = "http://api.mymemory.translated.net/get?q=" & word & "!&langpair=de|it&de=somemail#christmas.com"
Dim request = DirectCast(WebRequest.Create(url), HttpWebRequest)
' Note that unless you specify this explicitly,
' the framework will use the default and you
' will be limited to 2 parallel requests
' regardless of how many threads you spawn.
request.ServicePoint.ConnectionLimit = maxDegreeOfParallelism
Using response = DirectCast(request.GetResponse(), HttpWebResponse)
Using myreader As New StreamReader(response.GetResponseStream())
Dim rawresp = myreader.ReadToEnd()
Debug.WriteLine("Raw:" & rawresp)
' Transform the raw response here.
Dim processed = rawresp
uiFactory.StartNew(
Sub()
' This is running on the UI thread,
' so we can access the controls,
' i.e. add the processed result
' to the data grid.
Me.Text = processed
End Sub, TaskCreationOptions.PreferFairness)
End Using
End Using
End Sub)
End Sub)
transformTask.ContinueWith(
Sub(t As Task)
' Always stop the stopwatch.
sw.Stop()
' Again, we are back on the UI thread, so we
' could access UI controls if we needed to.
If t.Status = TaskStatus.Faulted Then
Debug.Print("The transformation errored: {0}", t.Exception)
Else
Debug.Print("Operation completed in {0} s.", sw.ElapsedMilliseconds / 1000)
End If
End Sub,
uiScheduler)
End Sub
End Class
If you want to send 10 parallel requests, you must create 10 BackgroundWorkers. Or manually create 10 threads. Then iterate, and whenever a worker/thread is done, give it a new task.
I do not recommend firing 5000 parallel threads/workers, you must be careful:
A load like that could be interpreted as spamming or an attack by the server. Don't overdo it, maybe talk to translated.net and ask them about the workload they accept.
Also think about what your machine and your internet upstream can handle.
I would create a Task for every request, so you can have a Callback for every call using ContinueWith:
For Each InputString As String In myCollectionString
Tasks.Task(Of String).Factory.StartNew(Function(inputString)
Dim request As HttpWebRequest
Dim myreader As StreamReader
Dim response As HttpWebResponse
Dim rawResp As String = String.Empty
Try
Dim url As String = "http://api.mymemory.translated.net/get?q=" & inputString & "!&langpair=de|it&de=somemail#christmas.com"
request = DirectCast(WebRequest.Create(url), HttpWebRequest)
response = DirectCast(request.GetResponse(), HttpWebResponse)
myreader = New StreamReader(response.GetResponseStream())
rawResp = myreader.ReadToEnd()
Debug.WriteLine("Raw:" & rawResp)
Catch ex As Exception
MessageBox.Show(ex.ToString)
End Try
Return rawResp
End Function, CancellationToken.None, _
Tasks.TaskCreationOptions.None).ContinueWith _
(Sub(task As Tasks.Task(Of String))
'Dom something with result
Console.WriteLine(task.Result)
End Sub)
Next

VB.net Parsing HTML 100 times. Will it work?

Imports System.Web
Imports System.Net
Imports System.Net.ServicePointManager
Public Class GetSource
Function GetHtml(ByVal strPage As String) As String
tryAgain:
ServicePointManager.UseNagleAlgorithm = True
ServicePointManager.Expect100Continue = True
ServicePointManager.CheckCertificateRevocationList = True
ServicePointManager.DefaultConnectionLimit = 100
Dim strReply As String = "NULL"
Try
Dim objhttprequest As System.Net.HttpWebRequest
Dim objhttpresponse As System.Net.HttpWebResponse
objhttprequest = System.Net.HttpWebRequest.Create(strPage)
objhttprequest.Proxy = proxyObject
objhttprequest.AllowAutoRedirect = True
objhttprequest.Timeout = 100000
objhttpresponse = objhttprequest.GetResponse
Dim objstrmreader As New StreamReader(objhttpresponse.GetResponseStream)
strReply = objstrmreader.ReadToEnd()
Catch ex2 As System.Net.WebException
GoTo tryAgain
Catch ex As Exception
strReply = "ERROR! " + ex.Message.ToString
GoTo tryAgain
End Try
Return strReply
End Function
What I got here is a vb.net code where I parse the website for its html
This function works fine.
The question is this...
1.If I run 100 threads with this function at the same time, Will it work?
2.Won't it affect my internet connection as well?
I don't want to waste time creating threads and codes a hundred times so if you know the answer please advice me on what should I do instead
One thing I see that could cause you problems is the goto. You retry if you get an error, but there is no way to break out of the method if an error does occur everytime you request the page, causing an infinite loop. You should put a check in, saying only try again if some cancel flag has not been set. Second, there could be issues with the number of threads you run depending on how much work each thread must do. There is a CPU and memory cost for each thread and it could peg your machine, especially if you get an infinite loop in one of them. Everything else gets a "it depends." Your pc and internet connection will determine everything else. There are tools available to monitor this and I would suggest using them to see what works. I found this page with a lot of information, it might have what you are looking for - http://www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html. Hope this helps.
Wade

My url checker function is hanging application in vb.net

Here is vb.net 2008 code is:
Public Function CheckURL(ByVal URL As String) As Boolean
Try
Dim Response As Net.WebResponse = Nothing
Dim WebReq As Net.HttpWebRequest = Net.HttpWebRequest.Create(URL)
Response = WebReq.GetResponse
Response.Close()
Return True
Catch ex As Exception
End Try
End Function
when a url is processing in checking it hangs my application for a while. Is this possible it checks smoothly all url list without hanging my application..
Is there any other fastest way to check urls?
Note: I have about 800 urls in file to check all links a valid by website responce or not.
If an exception occurs, the WebResponse object isn't properly disposed of. This can lead to your app running out of connections. Something like this will work better:
Try
Dim WebReq As Net.HttpWebRequest = Net.HttpWebRequest.Create(URL)
Using Response = WebReq.GetResponse()
Return True
End Using
Catch ex as WebException
Return False
End Try
This using the Using keyword ensures that the response is closed and finalized whenever that block exits.
If it's the server itself that's taking awhile to respond, look into the BeginGetResponse method on the HttpWebRequest. Check MSDN for a sample on how to use it. But be warned, that way also lies madness if you are not careful.
The answer is two fold:
Most of the waiting time is due to downloading content you don't need. If you request to only return the header, you will receive substantially less data, which will make your process faster.
As Matt identified, you aren't disposing of your connections, which may slow your process.
Expanding on Matt's answer, do the following:
Try
Dim WebReq As Net.HttpWebRequest = Net.HttpWebRequest.Create(URL)
WebReq.Method = "HEAD" 'This is the important line.
Using Response = WebReq.GetResponse()
Return True
End Using
Catch ex as WebException
Return False
End Try
GetResponse delivers you the whole content to your request. If this is what you want, there's not many room to speed up the request on the client side, since it mostly depends on the URLs server how fast to reply and how much data will be send over. If you just want to check if the URL is valid (or responding at all), it might be better to just ping it.
Keep in mind GetResponse isn't disposed when it runs into an error, so use the code posted by Matt to avoid this.
For your other problem, hanging application, you might avoid this be running this code as a thread.
This works basically like this (from here):
rem at the top of the code
Imports System.Threading
...
rem your event handler, p.e. button click or whatever
trd = New Thread(AddressOf ThreadTask)
trd.IsBackground = True
trd.Start()
rem your code
Private Sub ThreadTask()
dim i as long
Do
i += 1
Thread.Sleep(100)
Loop
End Sub

VB.net httpwebrequest For loop hangs every 10 or so iterations

I am trying to loop through an array and perform an httpwebrequest in each iteration.
The code seems to work, however it pauses for a while (eg alot longer than the set timeout. Tried setting that to 100 to check and it still pauses) after every 10 or so iterations, then carries on working.
Here is what i have so far:
For i As Integer = 0 To numberOfProxies - 1
Try
'create request to a proxyJudge php page using proxy
Dim request As HttpWebRequest = HttpWebRequest.Create("http://www.pr0.net/deny/azenv.php")
request.Proxy = New Net.WebProxy(proxies(i)) 'select the current proxie from the proxies array
request.Timeout = 10000 'set timeout
Dim response As HttpWebResponse = request.GetResponse()
Dim sr As StreamReader = New StreamReader(response.GetResponseStream())
Dim pageSourceCode As String = sr.ReadToEnd()
'check the downloaded source for certain phrases, each identifies a type of proxy
'HTTP_X_FORWARDED_FOR identifies a transparent proxy
If pageSourceCode.Contains("HTTP_X_FORWARDED_FOR") Then
'delegate method for cross thread safe
UpdateListbox(ListBox3, proxies(i))
ElseIf pageSourceCode.Contains("HTTP_VIA") Then
UpdateListbox(ListBox2, proxies(i))
Else
UpdateListbox(ListBox1, proxies(i))
End If
Catch ex As Exception
'MessageBox.Show(ex.ToString) used in testing
UpdateListbox(ListBox4, proxies(i))
End Try
completedProxyCheck += 1
lblTotalProxiesChecked.CustomInvoke(Sub(l) l.Text = completedProxyCheck)
Next
I have searched all over this site and via google, and most responses to this type of question say the response must be closed. I have tried a using block, eg:
Using response As HttpWebResponse = request.GetResponse()
Using sr As StreamReader = New StreamReader(response.GetResponseStream())
Dim pageSourceCode As String = sr.ReadToEnd()
'check the downloaded source for certain phrases, each identifies a type of proxy
'HTTP_X_FORWARDED_FOR identifies a transparent proxy
If pageSourceCode.Contains("HTTP_X_FORWARDED_FOR") Then
'delegate method for cross thread safe
UpdateListbox(ListBox3, proxies(i))
ElseIf pageSourceCode.Contains("HTTP_VIA") Then
UpdateListbox(ListBox2, proxies(i))
Else
UpdateListbox(ListBox1, proxies(i))
End If
End Using
End Using
And it makes no difference (though i may have implemented it wrong) As you can tell im very new to VB or any OOP so its probably a simple problem but i cant work it out.
Any suggestions or just tips on how to diagnose these types of problems would be really appreciated.
EDIT:
Now im thoroughly confused. Does the try catch statement automatically close the response, or do i need to put something in Finally? If so, what? i cant use response.close() because its declared in the try block.
Perhaps im just using really badly structured code and there is a much better way to do this? Or something else is causing the pause/hang?
Yeah, you need to close the response after you are done with it, as .net enforces a maximum number of concurrent requests
so just add
response.close()
at the end of your code block
Because, it's a very difficult to write code in comment I will continue as answer.
For i As Integer = 0 To numberOfProxies - 1
Dim response As HttpWebResponse
Try
'create request to a proxyJudge php page using proxy
Dim request As HttpWebRequest = HttpWebRequest.Create("http://www.pr0.net/deny/azenv.php")
request.Proxy = New Net.WebProxy(proxies(i)) 'select the current proxie from the proxies array
request.Timeout = 10000 'set timeout
response = request.GetResponse()
Dim sr As StreamReader = New StreamReader(response.GetResponseStream())
Dim pageSourceCode As String = sr.ReadToEnd()
'check the downloaded source for certain phrases, each identifies a type of proxy
'HTTP_X_FORWARDED_FOR identifies a transparent proxy
If pageSourceCode.Contains("HTTP_X_FORWARDED_FOR") Then
'delegate method for cross thread safe
UpdateListbox(ListBox3, proxies(i))
ElseIf pageSourceCode.Contains("HTTP_VIA") Then
UpdateListbox(ListBox2, proxies(i))
Else
UpdateListbox(ListBox1, proxies(i))
End If
Catch ex As Exception
'MessageBox.Show(ex.ToString) used in testing
UpdateListbox(ListBox4, proxies(i))
Finally
response.Close()
End Try
completedProxyCheck += 1
lblTotalProxiesChecked.CustomInvoke(Sub(l) l.Text = completedProxyCheck)
Next