VBA web page scroll - vba

I am getting a problem to scroll document in a proper position, also getting a problem to capture a proper detail in excel here is my Code please Sir suggest me where I am getting wrong
here i try with following code still getting some error
Public Sub GData()
'On Error Resume Next
Dim html As HTMLDocument
Dim Re, Cr, cipherDict As Object
Dim sResponse, cipherKey, Str, SG As String
Dim myArr, RsltArr(14) As Variant
Set Re = CreateObject("vbscript.regexp")
Set Cr = CreateObject("MSXML2.XMLHTTP")
Set cipherDict = CreateObject("Scripting.Dictionary")
Set html = New HTMLDocument
URL = "https://www.google.com/maps/place/Silky+Beauty+Salon/#22.2932632,70.7723656,17z/data=!3m1!4b1!4m5!3m4!1s0x3959ca1278f4820b:0x44e998d30e14a58c!8m2!3d22.2932632!4d70.7745543"
With Cr
.Open "GET", URL, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
s = .responseText
End With
With html
.body.innerHTML = sResponse
title = .querySelector("section-hero-header-title-title").innerText
phone = .querySelector("[data-item-id^=phone] [jsan*=text]").innerText
webSite = .querySelector("[aria-label^=Website] [jsan*=text]").innerText
End With
datarw = ActiveSheet.Cells(ActiveSheet.Rows.Count, "A").End(xlUp).Row + 1
ActiveSheet.Cells(datarw, 1).Value = title
ActiveSheet.Cells(datarw, 5).Value = phone
ActiveSheet.Cells(datarw, 7).Value = webSite
ActiveSheet.Cells(datarw, 1).Select
ActiveSheet.Rows(datarw).WrapText = False
End Sub

Looks like you can use combinations of different combinators (^ starts with and * contains) to search for substrings in attributes on the page to get your target nodes. Using descendant combinators to specify the relationship between attributes being used for anchoring.
Test if matched node Is Not Nothing before attempting to access either an attribute value or .innerText
Dim phone as Object, webSite As Object, title As Object
Set title = ie.document.querySelector(".section-hero-header-title-title")
Set phone = ie.document.querySelector("[data-item-id^=phone] [jsan*=text]")
Set website = ie.document.querySelector("[aria-label^=Website] [jsan*=text]")
If Not phone Is Nothing Then
'clean phone.innerText as appropriate
End If
If Not website Is Nothing Then
'clean website.innerText as appropriate
End If
To get the appropriate protocol for the website address, if missing, you can use the cleaned website address you have in a regex to pull the protocol from earlier in the html where it sits in a script tag.
Read about
css selectors: https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
querySelector: querySelector and querySelectorAll vs getElementsByClassName and getElementById in JavaScript

Related

Extract Data from URL VBA

I am trying to get Addresses Data from URL but facing some error. I am just beginner in VBA, i did not Understand where is problem in my code. wish somebody can help me to get right solution.
here I attached Image and also my VBA code
here is my Code
Public Sub IE_GetLink()
Dim sResponse As String, HTML As HTMLDocument
Dim url As String
Dim Re As Object
Set HTML = New HTMLDocument
Set Re = CreateObject("MSXML2.XMLHTTP")
'On Error Resume Next
url = "http://markexpress.co.in/network1.aspx?Center=360370&Tmp=1656224682265"
With Re
.Open "GET", url, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Dim Title As Object
With HTML
.body.innerHTML = sResponse
Title = .querySelectorAll("#colspan")(0).innerText
End With
MsgBox Title
End Sub
Please help me ...
Several things.
What is wrong with your code:
Title should be a string as you are attempting to assign the return of .innerText to it. You have declared it as an object which would require SET keyword (and the removal of the .innerText accessor).
Colspan is an attribute not an id so your css selector list is incorrect.
Furthermore, looking at what the page actually does, there is a request for an additional document which actually has the info you need. You need to take the centre ID you already have and change the URI you make a request to.
Then, you want only the first td in the target table. Change your CSS selector list to target that.
Public Sub GetInfo()
Dim HTML As MSHTML.HTMLDocument
Dim re As Object
Set HTML = New MSHTML.HTMLDocument
Set re = CreateObject("MSXML2.XMLHTTP")
Dim url As String
Dim response As String
url = "http://crm.markerp.in/NetworkDetail.aspx?Center=360370&Tmp="
With re
.Open "GET", url, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
response = .responseText
End With
Dim info As String
With HTML
.body.innerHTML = response
info = .querySelector("#tblDisp td").innerText
End With
MsgBox info
End Sub

Unable to use querySelector within querySelectorAll container in the right way

I'm trying to figure out how I can use .querySelector() on .querySelectorAll().
For example, I get expected results when I try like this:
Sub GetContent()
Const URL$ = "https://stackoverflow.com/questions/tagged/web-scraping?tab=Newest"
Dim HTMLDoc As New HTMLDocument
Dim HTML As New HTMLDocument, R&, I&
With New XMLHTTP60
.Open "Get", URL, False
.send
HTMLDoc.body.innerHTML = .responseText
End With
With HTMLDoc.querySelectorAll(".summary")
For I = 0 To .Length - 1
HTML.body.innerHTML = .Item(I).outerHTML
R = R + 1: Cells(R, 1).Value = HTML.querySelector(".question-hyperlink").innerText
Next I
End With
End Sub
The script doesn't work anymore when I pick another site in order to grab the values under Rank column available in the table even when I use the same logic:
Sub GetContent()
Const URL$ = "https://www.worldathletics.org/records/toplists/sprints/100-metres/outdoor/men/senior/2020?page=1"
Dim HTMLDoc As New HTMLDocument
Dim HTML As New HTMLDocument, R&, I&
With New XMLHTTP60
.Open "Get", URL, False
.send
HTMLDoc.body.innerHTML = .responseText
End With
With HTMLDoc.querySelectorAll("#toplists tbody tr")
For I = 0 To .Length - 1
HTML.body.innerHTML = .Item(I).outerHTML
R = R + 1: Cells(R, 1).Value = HTML.querySelector("td").innerText
Next I
End With
End Sub
This is the line Cells(R, 1).Value = HTML.querySelector().innerText In both the script I'm talking about. I'm using the same within this container .querySelectorAll().
If I use .querySelector() on .getElementsByTagName(), I found it working. I also found success using TagName on TagName or ClassName on ClassName e.t.c. So, I can grab the content in few different ways.
How can I use .querySelector() on .querySelectorAll() in the second script in order for it to work?
Wrap it in table tags so the html parser knows what to do with it.
HTML.body.innerHTML = "<table>" & .Item(I).outerHTML & "</table>"
Doing so preserves the structure of the opening td tag which is otherwise stripped of the "<".

Can't get rid of "old format or invalid type library" error in vba

I've written a script in vba to get some set names from a webpage and the script is getting them accordingly until it catches an error somewhere within the execution. This is the first time I encountered such error.
What my script is doing is get all the links under Company Sets and then tracking down each of the links it goes one layer deep and then following all the links under Set Name it goes another layer deep and finally parse the table from there. I parsed the name of PUBLISHED SET which is stored within the variable bName instead of the table as the script is getting bigger. I used IE to get the PUBLISHED SET as there are few leads which were causing encoding issues.
I searched through all the places to find any workaround but no luck.
However, I came across this thread where there is a proposed solution written in vb but can't figure out how can I make it work within vba.
Script I'm trying with:
Sub FetchRecords()
Const baseUrl$ = "https://www.psacard.com"
Const link = "https://www.psacard.com/psasetregistry/baseball/company-sets/16"
Dim IE As New InternetExplorer, Htmldoc As HTMLDocument
Dim Http As New XMLHTTP60, Html As New HTMLDocument, bName$, tRow As Object
Dim post As Object, elem As Object, posts As Object, I&, R&, C&
Dim key As Variant
Dim idic As Object: Set idic = CreateObject("Scripting.Dictionary")
With Http
.Open "GET", link, False
.send
Html.body.innerHTML = .responseText
End With
Set posts = Html.querySelectorAll(".dataTable tr td a[href*='/psasetregistry/baseball/company-sets/']")
For I = 0 To posts.Length - 7
idic(baseUrl & Split(posts(I).getAttribute("href"), "about:")(1)) = 1
Next I
For Each key In idic.Keys
With Http
.Open "GET", key, False
.send
Html.body.innerHTML = .responseText
End With
For Each post In Html.getElementsByTagName("a")
If InStr(post.getAttribute("title"), "Contact User") > 0 Then
If InStr(post.ParentNode.getElementsByTagName("a")(0).getAttribute("href"), "publishedset") > 0 Then
IE.Visible = True
IE.navigate baseUrl & Split(post.ParentNode.getElementsByTagName("a")(0).getAttribute("href"), "about:")(1)
While IE.Busy = True Or IE.readyState < 4: DoEvents: Wend
Set Htmldoc = IE.document
bName = Htmldoc.querySelector("h1 b.text-primary").innerText
If InStr(bName, "/") > 0 Then bName = Split(Htmldoc.querySelector(".inline-block a[href*='/contactuser/']").innerText, " ")(1)
R = R + 1: Cells(R, 1) = bName
End If
End If
Next post
Next key
IE.Quit
End Sub
I get that error pointing at the following line after extracting records between 70 to 90:
bName = Htmldoc.querySelector("h1 b.text-primary").innerText
The error looks like:
Automation Error: old format or invalid type library
Proposed solution in the linked thread written in vb (can't convert to vba):
'save the current settings for easier restoration later
Dim oldCI As System.Globalization.CultureInfo = _
System.Threading.Thread.CurrentThread.CurrentCulture
'change the settings
System.Threading.Thread.CurrentThread.CurrentCulture = _
New System.Globalization.CultureInfo("en-US")
Your code here
'restore the previous settings
System.Threading.Thread.CurrentThread.CurrentCulture = oldCI

Half of the records are getting scraped out of 84

I've made a parser in VBA which is able to scrape the name from yellow page Canada. However, the issue is that the page contains 84 Names but my parser is scraping only 41 Names. How can I fix this? Any help would be my blessing. Thanks in advance. Here is the code:
http.Open "GET", "http://www.yellowpages.ca/search/si/1/Outdoor%20wedding/Edmonton", False
http.send
html.body.innerHTML = http.responseText
Set topics = html.getElementsByClassName("listing__name--link jsListingName")
For Each topic In topics
Cells(x, 1) = topic.innerText
x = x + 1
Next topic
Btw, I used the MSxml2.xmlhttp60 request.
If you look at the page's web requests, you'll notice it'll trigger another web request once the page has been scrolled past a certain point.
The format of the new requests is like this:
First 40 records: http://www.yellowpages.ca/search/si/1/Outdoor%20wedding/Edmonton
Next 40 records: http://www.yellowpages.ca/search/si/2/Outdoor%20wedding/Edmonton
Next 40 records: http://www.yellowpages.ca/search/si/3/Outdoor%20wedding/Edmonton
Basically for new data (in batches of 40 records) it increments part of the URL by 1.
Which is good news, we can just do a loop to return the results. Here's the code I came up with. For whatever reason, the getElementsByClassName selector wasn't working for me, so I worked around it in my code. If you can use that selector, use that instead of what I have below for that part.
Lastly, I added an explicit reference to Microsoft XML v6.0, so you should do the same to get this to function as it is.
Option Explicit
Public Sub SOTestScraper()
Dim topics As Object
Dim topic As Object
Dim webResp As Object
Dim i As Long
Dim j As Long
Dim mySheet As Worksheet: Set mySheet = ThisWorkbook.Sheets("Sheet1") ' Change this
Dim myArr() As Variant: ReDim myArr(10000) 'Probably overkill
For i = 1 To 20 ' unsure how many records you expect, I defaulted to 20 pages, or 800 results
Set webResp = getWebResponse(CStr(i)) ' return the web response
Set topics = webResp.getElementsByTagName("*") ' I couldn't find the className so I did this instead
If topics Is Nothing Then Exit For 'Exit the for loop if Status 200 wasn't received
For Each topic In topics
On Error Resume Next
'If getElementByClassName is working for you, use it
If topic.ClassName = "listing__name--link jsListingName" Then
myArr(j) = topic.InnerText
j = j + 1
End If
Next
Next
'add the data to Excel
ReDim Preserve myArr(j - 1)
mySheet.Range("A1:A" & j) = WorksheetFunction.Transpose(myArr)
End Sub
Function getWebResponse(ByVal pageNumber As String) As Object
Dim http As MSXML2.ServerXMLHTTP60: Set http = New MSXML2.ServerXMLHTTP60
Dim html As Object: Set html = CreateObject("htmlfile")
With http
.Open "GET", "http://www.yellowpages.ca/search/si/" & pageNumber & "/Outdoor%20wedding/Edmonton"
.send
.waitForResponse
html.body.innerHTML = .responseText
.waitForResponse
End With
If Not http.Status = 200 Then
Set getWebResponse = Nothing
Else
Set getWebResponse = html
End If
Set html = Nothing
Set http = Nothing
End Function

Unable to go deep for certain barriers in a multilayered webpage to fetch data

How to reach the last layer of a webpage starting from the first page? I tried but got stuck. Every time I run my code to go deep it crawls the same page again and again. Finally , I made it. Here is the full code.
Sub bjscrawler()
Const url = "http://www.bjs.com"
Dim html As New HTMLDocument, htm As New HTMLDocument
Dim topics As Object, post As Object, topic As Object, newlinks As String
Dim links As Object, link As Object, data As Object
With CreateObject("MSXML2.serverXMLHTTP")
.Open "GET", url, False
.setRequestHeader "Content-Type", "text/xml"
.send
html.body.innerHTML = .responseText
End With
Set topics = html.getElementsByClassName("text")
For Each post In topics
Set topic = post.getElementsByTagName("a")(0)
newlinks = url & Split(topic.href, ":")(1)
With CreateObject("MSXML2.serverXMLHTTP")
.Open "GET", newlinks, False
.send
htm.body.innerHTML = .responseText
End With
Set links = htm.getElementsByClassName("rightView")
For Each link In links
Set data = link.getElementsByTagName("h1")(0)
x = x + 1
Cells(x, 1) = data.innerText
Next link
Next post
End Sub
In the code:
For m = 0 To mla.Length - 1
z = mla(m).getAttribute("href")
link = pageurl & Mid(z, InStr(z, ":") + 1)
Next m
link will only contain the last url of mla. All the other ones are "gone".
Also check the url you created in link, it can be invalid. As a result, the next GET wil fail, but the code doesn't check that and just "carries on". http.responseText will for example be 404 page not found, the call hmm.getElementsByClassName will return an empty set and For Each fla will be an emty loop.
In the code:
If cc <> "" Then
refinedlinks = cc
End If
validlinks = refinedlinks
Cells(x, 1) = validlinks
x = x + 1
you fill the cell also when cc was empty, which generates duplicates. Change to:
If cc <> "" Then
Cells(x, 1) = cc
x = x + 1
End If
When you say
''' I'm stuck at this point. Not i can pull links from here nor can go
'''deeper. Because object elements are not same for all the links.
you probably want to process all the cells you just filled, not only this last validlinks. So iterate over the cells:
lastx= x
For x= 1 to lastx
http.Open "GET", Cells(x, 1), False
I am not sure what you mean with "Because object elements are not same for all the links". I hope these suggestions help you.