Web-scraping across multipages without even knowing the last page number - vba

Running my code for a site to crawl the titles of different tutorials spreading across several pages, I found it working flawless. I tried to write some code not depending on the last page number the url has but on the status code until it shows http.status<>200. The code I'm pasting below is working impeccably in this case. However, Trouble comes up when I try to use another url to see whether it breaks automatically but found that the code did fetch all the results but did not break. What is the workaround in this case so that the code will break when it is done and stop the macro? Here is the working one?
Sub WiseOwl()
Const mlink = "http://www.wiseowl.co.uk/videos/default"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object
Do While True
y = y + 1
With http
.Open "GET", mlink & "-" & y & ".htm", False
.send
If .Status <> 200 Then
MsgBox "It's done"
Exit Sub
End If
html.body.innerHTML = .responseText
End With
For Each post In html.getElementsByClassName("woVideoListDefaultSeriesTitle")
With post.getElementsByTagName("a")
x = x + 1
If .Length Then Cells(x, 1) = .item(0).innerText
End With
Next post
Loop
End Sub
I found a logic to get around with yellowpage. My update script is able to parse yellowpage but breaks before scraping the last page because there is no "Next Page" button. I tried with this:
"https://www.dropbox.com/s/iptqm79b0byw3dz/Yellowpage.txt?dl=0"
However, the same logic I tried to apply with torrent site but it doesn't work here:
"https://www.yify-torrent.org/genres/western/p-1/"

You can always rely on elements if they exits or not. Here for example, if you try to use the object which you have set your element to, you will get:
Run-time error '91': Object variable or With block variable not set
This is the key you should be looking for to put an end to your code. Please see the below example:
Sub yify()
Const mlink = "https://www.yify-torrent.org/genres/western/p-"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object
Dim posts As Object
y = 1
Do
With http
.Open "GET", mlink & y & "/", False
.send
html.body.innerHTML = .responseText
End With
Set posts = html.getElementsByClassName("mv")
On Error GoTo Endofpage
Debug.Print Len(posts) 'to force Error 91
For Each post In posts
With post.getElementsByTagName("div")
x = x + 1
If .Length Then Cells(x, 1) = .Item(0).innerText
End With
Next post
y = y + 1
Endofpage:
Loop Until Err.Number = 91
Debug.Print "It's over"
End Sub

Related

Can't get rid of "old format or invalid type library" error in vba

I've written a script in vba to get some set names from a webpage and the script is getting them accordingly until it catches an error somewhere within the execution. This is the first time I encountered such error.
What my script is doing is get all the links under Company Sets and then tracking down each of the links it goes one layer deep and then following all the links under Set Name it goes another layer deep and finally parse the table from there. I parsed the name of PUBLISHED SET which is stored within the variable bName instead of the table as the script is getting bigger. I used IE to get the PUBLISHED SET as there are few leads which were causing encoding issues.
I searched through all the places to find any workaround but no luck.
However, I came across this thread where there is a proposed solution written in vb but can't figure out how can I make it work within vba.
Script I'm trying with:
Sub FetchRecords()
Const baseUrl$ = "https://www.psacard.com"
Const link = "https://www.psacard.com/psasetregistry/baseball/company-sets/16"
Dim IE As New InternetExplorer, Htmldoc As HTMLDocument
Dim Http As New XMLHTTP60, Html As New HTMLDocument, bName$, tRow As Object
Dim post As Object, elem As Object, posts As Object, I&, R&, C&
Dim key As Variant
Dim idic As Object: Set idic = CreateObject("Scripting.Dictionary")
With Http
.Open "GET", link, False
.send
Html.body.innerHTML = .responseText
End With
Set posts = Html.querySelectorAll(".dataTable tr td a[href*='/psasetregistry/baseball/company-sets/']")
For I = 0 To posts.Length - 7
idic(baseUrl & Split(posts(I).getAttribute("href"), "about:")(1)) = 1
Next I
For Each key In idic.Keys
With Http
.Open "GET", key, False
.send
Html.body.innerHTML = .responseText
End With
For Each post In Html.getElementsByTagName("a")
If InStr(post.getAttribute("title"), "Contact User") > 0 Then
If InStr(post.ParentNode.getElementsByTagName("a")(0).getAttribute("href"), "publishedset") > 0 Then
IE.Visible = True
IE.navigate baseUrl & Split(post.ParentNode.getElementsByTagName("a")(0).getAttribute("href"), "about:")(1)
While IE.Busy = True Or IE.readyState < 4: DoEvents: Wend
Set Htmldoc = IE.document
bName = Htmldoc.querySelector("h1 b.text-primary").innerText
If InStr(bName, "/") > 0 Then bName = Split(Htmldoc.querySelector(".inline-block a[href*='/contactuser/']").innerText, " ")(1)
R = R + 1: Cells(R, 1) = bName
End If
End If
Next post
Next key
IE.Quit
End Sub
I get that error pointing at the following line after extracting records between 70 to 90:
bName = Htmldoc.querySelector("h1 b.text-primary").innerText
The error looks like:
Automation Error: old format or invalid type library
Proposed solution in the linked thread written in vb (can't convert to vba):
'save the current settings for easier restoration later
Dim oldCI As System.Globalization.CultureInfo = _
System.Threading.Thread.CurrentThread.CurrentCulture
'change the settings
System.Threading.Thread.CurrentThread.CurrentCulture = _
New System.Globalization.CultureInfo("en-US")
Your code here
'restore the previous settings
System.Threading.Thread.CurrentThread.CurrentCulture = oldCI

EXCEL - Open all links in a new tab

I have an excel-sheet which contains many links.
How do I open them all at once in a new tab with my default browser?
Like this? Included checking url is valid (basic check). The advantage here is you adapt to log information about the response from the URL.
Option Explicit
Sub TEST()
Dim h As Hyperlink
For Each h In ActiveSheet.Hyperlinks
If UrlOK(h.Address) Then h.Follow
Next h
End Sub
Public Function UrlOK(ByVal url As String) As Boolean
Dim request As Object
Dim respCode As Long
On Error Resume Next
Set request = CreateObject("WinHttp.WinHttpRequest.5.1")
With request
.Open "GET", url, False
.Send
respCode = .Status
End With
If respCode = 200 Then UrlOK = True
On Error GoTo 0
End Function
Edit: Thanks to #Omegastripes for noting
1) If you use MSXML2.XMLHTTP over WinHttp.WinHttpRequest.5.1 you get a more reliable result
Benefits include (amongst others):
A) Simplified code to open a URL.
B) Separate sessions do not impact each other.
C) Protected Mode IE Support
D) Credential Cache
2) Use HEAD over GET, in the request, to reduce network traffic
With a HEAD request, a server will only return the headers of a resource, rather than the resource itself.
So you could use a revised, more efficient function, as follows:
Public Function UrlOK(ByVal url As String) As Boolean
Dim request As Object
Dim respCode As Long
On Error Resume Next
Set request = CreateObject("MSXML2.XMLHTTP")
With request
.Open "HEAD", url, False
.Send
respCode = .Status
End With
If respCode = 200 Then UrlOK = True
On Error GoTo 0
End Function
Image of code in a standard module and where to place cursor to execute Test sub.
That's pretty easy in VBA
Sub OpenAll()
Dim H As Hyperlink
For Each H In ActiveWorkbook.ActiveSheet.UsedRange.Hyperlinks
H.Follow
Next
End Sub
If there are invalid URLs you can stop the code from erroring like this:
Sub OpenAll()
Dim H As Hyperlink
For Each H In ActiveWorkbook.ActiveSheet.Hyperlinks
On Error Resume Next
H.Follow
On Error GoTo 0
Next
End Sub

How to handle errors efficiently to prevent misleading results?

I've written some code in vba to find certain identities against some names in some websites. The code is working well if everything in it's right order, i meant if the link is valid, the name matches with a tags and finally the regex can find the identity. If any of the three or all of the three are bad searches then the script throws error. I've already specified the position where error occurs in my below script.
All i expect from you experts to provide me with any solution as to how i can handle the errors and let my script continue until all the links are exhausted.
As I do not have much knowledge on VBA so i tried with On error resume next to skip the errors. However, it turns out to be a clear mess when i take a look at the results. I'm pasting a rough example what i get when i use On error resume next.
Sub Identity_Finder()
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object, link As Variant, refined_links As String
Dim rxp As New RegExp, identity As Object
For Each link In [{"http://spltech.in/","http://www.unifrostindia.com/","http://www.unitfrostindia.com/","http://www.greenplanet.in/"}]
With http
.Open "GET", link, False
.send '''throws here the first error if the link is invalid
html.body.innerHTML = .responseText
End With
For Each post In html.getElementsByTagName("a")
If InStr(post.innerText, "certain_name") > 0 Then refined_links = post.href: Exit For
Next post
With http
.Open "GET", refined_links, False
.send ''throws another error here if no such link is found
End With
With rxp
.Pattern = "some_regex"
.Global = True
Set identity = .Execute(http.responseText)
End With
r = r + 1: Cells(r, 1) = link
Cells(r, 2) = identity(0) ''''throws another error here if no such identity is noticed
Next link
End Sub
Upon using On error resume next What i get:
John executive
Mac lawyer
lulu lawyer
Robin lawyer
Cathy student
Expected output:
John executive
Mac lawyer
lulu
Robin
Cathy student
The empty fields (when they are not found) are getting filled in with the previous values when i use On error resume next. How can I get around this misleading result? Thanks in advance.
The most efficient way to error trap in VBA is to
1) actually test the inputs / results before running them either through custom-made functions or built-in coding concepts or a mix of both.
2) Use VBA built-in error-handling if absolutely needed
Example 1
For example. You can wrap this statement with a custom function to test if a URL is valid or not.
With http
.Open "GET", link, False
.send '''throws here the first error if the link is invalid
html.body.innerHTML = .responseText
End With
If ValidURL Then
With http
.Open "GET", link, False
.send
html.body.innerHTML = .responseText
End With
End If
Where ValidURL is a function defined as:
Function ValidURL(URL as String) as Boolean
Dim result as Boolean
'I don't know how you would specify a valid link in your specific case
'but that code goes here
'a dummy example follows
result = Left(URL,7) = "http://"
ValidURL = result 'True or False
End Function
Example 2
I assume in this statement:
With http
.Open "GET", refined_links, False
.send ''throws another error here if no such link is found
End With
there is a specific error number (code) that is produced when no such link is found. Discover that number and use this code to bypass.
With http
.Open "GET", refined_links, False
On Error Resume Next
.Send
On Error GoTo 0
End With
If err.Number <> 9999 'replace with correct number
'continue with regex test
End If
PUTTING IT ALL TOGETHER
Finally putting that all together you can build like so, with minimal use of On Error Resume Next and no GoTo statements.
For Each link In [{"http://spltech.in/","http://www.unifrostindia.com/","http://www.unitfrostindia.com/","http://www.greenplanet.in/"}]
If ValidURL(link) Then
With http
.Open "GET", link, False
.send
html.body.innerHTML = .responseText
End With
For Each post In html.getElementsByTagName("a")
If InStr(post.innerText, "certain_name") > 0 Then refined_links = post.href: Exit For
Next post
With http
.Open "GET", refined_links, False
On Error Resume Next
.Send
On Error GoTo 0
End With
If err.Number <> 9999 'replace with correct number
With rxp
.Pattern = "some_regex"
.Global = True
Set identity = .Execute(http.responseText)
End With
'i will leave it to you on how to account for no pattern match
r = r + 1: Cells(r, 1) = link
Cells(r, 2) = identity(0) ''''throws another error here if no such identity is noticed
End If
End If
Next link

Half of the records are getting scraped out of 84

I've made a parser in VBA which is able to scrape the name from yellow page Canada. However, the issue is that the page contains 84 Names but my parser is scraping only 41 Names. How can I fix this? Any help would be my blessing. Thanks in advance. Here is the code:
http.Open "GET", "http://www.yellowpages.ca/search/si/1/Outdoor%20wedding/Edmonton", False
http.send
html.body.innerHTML = http.responseText
Set topics = html.getElementsByClassName("listing__name--link jsListingName")
For Each topic In topics
Cells(x, 1) = topic.innerText
x = x + 1
Next topic
Btw, I used the MSxml2.xmlhttp60 request.
If you look at the page's web requests, you'll notice it'll trigger another web request once the page has been scrolled past a certain point.
The format of the new requests is like this:
First 40 records: http://www.yellowpages.ca/search/si/1/Outdoor%20wedding/Edmonton
Next 40 records: http://www.yellowpages.ca/search/si/2/Outdoor%20wedding/Edmonton
Next 40 records: http://www.yellowpages.ca/search/si/3/Outdoor%20wedding/Edmonton
Basically for new data (in batches of 40 records) it increments part of the URL by 1.
Which is good news, we can just do a loop to return the results. Here's the code I came up with. For whatever reason, the getElementsByClassName selector wasn't working for me, so I worked around it in my code. If you can use that selector, use that instead of what I have below for that part.
Lastly, I added an explicit reference to Microsoft XML v6.0, so you should do the same to get this to function as it is.
Option Explicit
Public Sub SOTestScraper()
Dim topics As Object
Dim topic As Object
Dim webResp As Object
Dim i As Long
Dim j As Long
Dim mySheet As Worksheet: Set mySheet = ThisWorkbook.Sheets("Sheet1") ' Change this
Dim myArr() As Variant: ReDim myArr(10000) 'Probably overkill
For i = 1 To 20 ' unsure how many records you expect, I defaulted to 20 pages, or 800 results
Set webResp = getWebResponse(CStr(i)) ' return the web response
Set topics = webResp.getElementsByTagName("*") ' I couldn't find the className so I did this instead
If topics Is Nothing Then Exit For 'Exit the for loop if Status 200 wasn't received
For Each topic In topics
On Error Resume Next
'If getElementByClassName is working for you, use it
If topic.ClassName = "listing__name--link jsListingName" Then
myArr(j) = topic.InnerText
j = j + 1
End If
Next
Next
'add the data to Excel
ReDim Preserve myArr(j - 1)
mySheet.Range("A1:A" & j) = WorksheetFunction.Transpose(myArr)
End Sub
Function getWebResponse(ByVal pageNumber As String) As Object
Dim http As MSXML2.ServerXMLHTTP60: Set http = New MSXML2.ServerXMLHTTP60
Dim html As Object: Set html = CreateObject("htmlfile")
With http
.Open "GET", "http://www.yellowpages.ca/search/si/" & pageNumber & "/Outdoor%20wedding/Edmonton"
.send
.waitForResponse
html.body.innerHTML = .responseText
.waitForResponse
End With
If Not http.Status = 200 Then
Set getWebResponse = Nothing
Else
Set getWebResponse = html
End If
Set html = Nothing
Set http = Nothing
End Function

Unable to go deep for certain barriers in a multilayered webpage to fetch data

How to reach the last layer of a webpage starting from the first page? I tried but got stuck. Every time I run my code to go deep it crawls the same page again and again. Finally , I made it. Here is the full code.
Sub bjscrawler()
Const url = "http://www.bjs.com"
Dim html As New HTMLDocument, htm As New HTMLDocument
Dim topics As Object, post As Object, topic As Object, newlinks As String
Dim links As Object, link As Object, data As Object
With CreateObject("MSXML2.serverXMLHTTP")
.Open "GET", url, False
.setRequestHeader "Content-Type", "text/xml"
.send
html.body.innerHTML = .responseText
End With
Set topics = html.getElementsByClassName("text")
For Each post In topics
Set topic = post.getElementsByTagName("a")(0)
newlinks = url & Split(topic.href, ":")(1)
With CreateObject("MSXML2.serverXMLHTTP")
.Open "GET", newlinks, False
.send
htm.body.innerHTML = .responseText
End With
Set links = htm.getElementsByClassName("rightView")
For Each link In links
Set data = link.getElementsByTagName("h1")(0)
x = x + 1
Cells(x, 1) = data.innerText
Next link
Next post
End Sub
In the code:
For m = 0 To mla.Length - 1
z = mla(m).getAttribute("href")
link = pageurl & Mid(z, InStr(z, ":") + 1)
Next m
link will only contain the last url of mla. All the other ones are "gone".
Also check the url you created in link, it can be invalid. As a result, the next GET wil fail, but the code doesn't check that and just "carries on". http.responseText will for example be 404 page not found, the call hmm.getElementsByClassName will return an empty set and For Each fla will be an emty loop.
In the code:
If cc <> "" Then
refinedlinks = cc
End If
validlinks = refinedlinks
Cells(x, 1) = validlinks
x = x + 1
you fill the cell also when cc was empty, which generates duplicates. Change to:
If cc <> "" Then
Cells(x, 1) = cc
x = x + 1
End If
When you say
''' I'm stuck at this point. Not i can pull links from here nor can go
'''deeper. Because object elements are not same for all the links.
you probably want to process all the cells you just filled, not only this last validlinks. So iterate over the cells:
lastx= x
For x= 1 to lastx
http.Open "GET", Cells(x, 1), False
I am not sure what you mean with "Because object elements are not same for all the links". I hope these suggestions help you.