JSON values extraction from webpage using MSXML2.XMLHTTP - vba

Currently i am able to extract values from webpage but facing issue json value extraction.
I am using following code for other values extraction.
On Error Resume Next
Set http = CreateObject("MSXML2.XMLHTTP")
http.Open "GET", url1234, False
http.Send
html.body.innerHTML = http.ResponseText
brand = html.body.innerText
'MsgBox (brand)
Above code is not extracting following values of this url
"" : {"0":"B0037RYT96","1":"B0152VYOQ2","2":"B0152WOT70","3":"B003W0NYKS","4":"B0152WOT8Y","5":"B00C2O7M1M","6":"B0037RMS6W","7":"B0037RMI0S","8":"B0152VYPXY"},

There isn't anything I can see in your code that attempts to extract this.
You could use regex to specify the appropriate pattern to extract that string. Below, the string you are after is stored in r variable.
EDIT:
Given your edit to the required string you can change the regex to:
\"dimensionToAsinMap\" :(.*)[^\r\n].*
Try it here
Former answer:
Try regex here
Option Explicit
Public Sub GetData()
Dim s As String, r As String, re As Object
Set re = CreateObject("vbscript.regexp")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.yoursite.com?tag=stackoverfl08-20", False
.send
s = .responseText
End With
With re
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = "(""dimensionToAsinMap"" :.(.|\n)*)[;^\r\n].*return dataToReturn"
If .test(s) Then
r = .Execute(s)(0).SubMatches(0)
Else
r = "No match"
End If
End With
End Sub
Locals window check:
Regex explanation:

Related

VBA web page scroll

I am getting a problem to scroll document in a proper position, also getting a problem to capture a proper detail in excel here is my Code please Sir suggest me where I am getting wrong
here i try with following code still getting some error
Public Sub GData()
'On Error Resume Next
Dim html As HTMLDocument
Dim Re, Cr, cipherDict As Object
Dim sResponse, cipherKey, Str, SG As String
Dim myArr, RsltArr(14) As Variant
Set Re = CreateObject("vbscript.regexp")
Set Cr = CreateObject("MSXML2.XMLHTTP")
Set cipherDict = CreateObject("Scripting.Dictionary")
Set html = New HTMLDocument
URL = "https://www.google.com/maps/place/Silky+Beauty+Salon/#22.2932632,70.7723656,17z/data=!3m1!4b1!4m5!3m4!1s0x3959ca1278f4820b:0x44e998d30e14a58c!8m2!3d22.2932632!4d70.7745543"
With Cr
.Open "GET", URL, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
s = .responseText
End With
With html
.body.innerHTML = sResponse
title = .querySelector("section-hero-header-title-title").innerText
phone = .querySelector("[data-item-id^=phone] [jsan*=text]").innerText
webSite = .querySelector("[aria-label^=Website] [jsan*=text]").innerText
End With
datarw = ActiveSheet.Cells(ActiveSheet.Rows.Count, "A").End(xlUp).Row + 1
ActiveSheet.Cells(datarw, 1).Value = title
ActiveSheet.Cells(datarw, 5).Value = phone
ActiveSheet.Cells(datarw, 7).Value = webSite
ActiveSheet.Cells(datarw, 1).Select
ActiveSheet.Rows(datarw).WrapText = False
End Sub
Looks like you can use combinations of different combinators (^ starts with and * contains) to search for substrings in attributes on the page to get your target nodes. Using descendant combinators to specify the relationship between attributes being used for anchoring.
Test if matched node Is Not Nothing before attempting to access either an attribute value or .innerText
Dim phone as Object, webSite As Object, title As Object
Set title = ie.document.querySelector(".section-hero-header-title-title")
Set phone = ie.document.querySelector("[data-item-id^=phone] [jsan*=text]")
Set website = ie.document.querySelector("[aria-label^=Website] [jsan*=text]")
If Not phone Is Nothing Then
'clean phone.innerText as appropriate
End If
If Not website Is Nothing Then
'clean website.innerText as appropriate
End If
To get the appropriate protocol for the website address, if missing, you can use the cleaned website address you have in a regex to pull the protocol from earlier in the html where it sits in a script tag.
Read about
css selectors: https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
querySelector: querySelector and querySelectorAll vs getElementsByClassName and getElementById in JavaScript

Unable to use querySelector within querySelectorAll container in the right way

I'm trying to figure out how I can use .querySelector() on .querySelectorAll().
For example, I get expected results when I try like this:
Sub GetContent()
Const URL$ = "https://stackoverflow.com/questions/tagged/web-scraping?tab=Newest"
Dim HTMLDoc As New HTMLDocument
Dim HTML As New HTMLDocument, R&, I&
With New XMLHTTP60
.Open "Get", URL, False
.send
HTMLDoc.body.innerHTML = .responseText
End With
With HTMLDoc.querySelectorAll(".summary")
For I = 0 To .Length - 1
HTML.body.innerHTML = .Item(I).outerHTML
R = R + 1: Cells(R, 1).Value = HTML.querySelector(".question-hyperlink").innerText
Next I
End With
End Sub
The script doesn't work anymore when I pick another site in order to grab the values under Rank column available in the table even when I use the same logic:
Sub GetContent()
Const URL$ = "https://www.worldathletics.org/records/toplists/sprints/100-metres/outdoor/men/senior/2020?page=1"
Dim HTMLDoc As New HTMLDocument
Dim HTML As New HTMLDocument, R&, I&
With New XMLHTTP60
.Open "Get", URL, False
.send
HTMLDoc.body.innerHTML = .responseText
End With
With HTMLDoc.querySelectorAll("#toplists tbody tr")
For I = 0 To .Length - 1
HTML.body.innerHTML = .Item(I).outerHTML
R = R + 1: Cells(R, 1).Value = HTML.querySelector("td").innerText
Next I
End With
End Sub
This is the line Cells(R, 1).Value = HTML.querySelector().innerText In both the script I'm talking about. I'm using the same within this container .querySelectorAll().
If I use .querySelector() on .getElementsByTagName(), I found it working. I also found success using TagName on TagName or ClassName on ClassName e.t.c. So, I can grab the content in few different ways.
How can I use .querySelector() on .querySelectorAll() in the second script in order for it to work?
Wrap it in table tags so the html parser knows what to do with it.
HTML.body.innerHTML = "<table>" & .Item(I).outerHTML & "</table>"
Doing so preserves the structure of the opening td tag which is otherwise stripped of the "<".

How to handle errors efficiently to prevent misleading results?

I've written some code in vba to find certain identities against some names in some websites. The code is working well if everything in it's right order, i meant if the link is valid, the name matches with a tags and finally the regex can find the identity. If any of the three or all of the three are bad searches then the script throws error. I've already specified the position where error occurs in my below script.
All i expect from you experts to provide me with any solution as to how i can handle the errors and let my script continue until all the links are exhausted.
As I do not have much knowledge on VBA so i tried with On error resume next to skip the errors. However, it turns out to be a clear mess when i take a look at the results. I'm pasting a rough example what i get when i use On error resume next.
Sub Identity_Finder()
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object, link As Variant, refined_links As String
Dim rxp As New RegExp, identity As Object
For Each link In [{"http://spltech.in/","http://www.unifrostindia.com/","http://www.unitfrostindia.com/","http://www.greenplanet.in/"}]
With http
.Open "GET", link, False
.send '''throws here the first error if the link is invalid
html.body.innerHTML = .responseText
End With
For Each post In html.getElementsByTagName("a")
If InStr(post.innerText, "certain_name") > 0 Then refined_links = post.href: Exit For
Next post
With http
.Open "GET", refined_links, False
.send ''throws another error here if no such link is found
End With
With rxp
.Pattern = "some_regex"
.Global = True
Set identity = .Execute(http.responseText)
End With
r = r + 1: Cells(r, 1) = link
Cells(r, 2) = identity(0) ''''throws another error here if no such identity is noticed
Next link
End Sub
Upon using On error resume next What i get:
John executive
Mac lawyer
lulu lawyer
Robin lawyer
Cathy student
Expected output:
John executive
Mac lawyer
lulu
Robin
Cathy student
The empty fields (when they are not found) are getting filled in with the previous values when i use On error resume next. How can I get around this misleading result? Thanks in advance.
The most efficient way to error trap in VBA is to
1) actually test the inputs / results before running them either through custom-made functions or built-in coding concepts or a mix of both.
2) Use VBA built-in error-handling if absolutely needed
Example 1
For example. You can wrap this statement with a custom function to test if a URL is valid or not.
With http
.Open "GET", link, False
.send '''throws here the first error if the link is invalid
html.body.innerHTML = .responseText
End With
If ValidURL Then
With http
.Open "GET", link, False
.send
html.body.innerHTML = .responseText
End With
End If
Where ValidURL is a function defined as:
Function ValidURL(URL as String) as Boolean
Dim result as Boolean
'I don't know how you would specify a valid link in your specific case
'but that code goes here
'a dummy example follows
result = Left(URL,7) = "http://"
ValidURL = result 'True or False
End Function
Example 2
I assume in this statement:
With http
.Open "GET", refined_links, False
.send ''throws another error here if no such link is found
End With
there is a specific error number (code) that is produced when no such link is found. Discover that number and use this code to bypass.
With http
.Open "GET", refined_links, False
On Error Resume Next
.Send
On Error GoTo 0
End With
If err.Number <> 9999 'replace with correct number
'continue with regex test
End If
PUTTING IT ALL TOGETHER
Finally putting that all together you can build like so, with minimal use of On Error Resume Next and no GoTo statements.
For Each link In [{"http://spltech.in/","http://www.unifrostindia.com/","http://www.unitfrostindia.com/","http://www.greenplanet.in/"}]
If ValidURL(link) Then
With http
.Open "GET", link, False
.send
html.body.innerHTML = .responseText
End With
For Each post In html.getElementsByTagName("a")
If InStr(post.innerText, "certain_name") > 0 Then refined_links = post.href: Exit For
Next post
With http
.Open "GET", refined_links, False
On Error Resume Next
.Send
On Error GoTo 0
End With
If err.Number <> 9999 'replace with correct number
With rxp
.Pattern = "some_regex"
.Global = True
Set identity = .Execute(http.responseText)
End With
'i will leave it to you on how to account for no pattern match
r = r + 1: Cells(r, 1) = link
Cells(r, 2) = identity(0) ''''throws another error here if no such identity is noticed
End If
End If
Next link

VBA Code from NULL #VALUE! to Zero

I am using the below code to retrieve some data from websites.
Public Function giveMeValue(ByVal link As String) As String
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "POST", link, False
.send
htm.body.innerhtml = .responsetext
End With
With htm.getelementbyid("JS_topStoreCount")
giveMeValue = .innerText
End With
htm.Close
Set htm = Nothing
End Function
Sometimes the element with ID "JS_topStoreCount" doesn't exist and the function returns #VALUE!. How do I modify this function so that errors are returned as 0 and are highlighted in red?
I couldn't see the reason for the Do Loop so I have removed it, I've added an if statement to check if the html element is nothing before assigning it to the return value.
Public Function giveMeValue(ByVal link As String) As String
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", link, False
.send
htm.body.innerhtml = .responsetext
End With
If Not htm.getelementbyId("JS_topStoreCount") Is Nothing Then
giveMeValue = htm.getelementbyId("JS_topStoreCount").innerText
Else
giveMeValue = "0"
End If
htm.Close
Set htm = Nothing
End Function

XMLHTTP.send request brings back "Nothing"

I have a spreadsheet that has hundreds of links that point to a server (with authentication) that can be accessed via the web. I've been searching for a solution to a Link Checker in a spreadsheet that would tell me which links are broken and which are ok. By broken I mean that the website does not get called up at all.
There are various solutions I have found around the web, none of which work for me. I'm boggled by this...
One example that I've tried to use and figure out is re-posted below.
As I've stepped through the code, I have come to realize that the oHTTP.send request brings back "Nothing". It does so for all links in the spreadsheet, regardless of whether the link works, or not.
Public Function CheckHyperlink(ByVal strUrl As String) As Boolean
Dim oHttp As New MSXML2.XMLHTTP30
On Error GoTo ErrorHandler
oHttp.Open "HEAD", strUrl, False
oHttp.send
If Not oHttp.Status = 200 Then CheckHyperlink = False Else CheckHyperlink = True
Exit Function
ErrorHandler:
CheckHyperlink = False
End Function
Any suggestions as to what might be wrong, or right, is highly appreciated!
A couple of possible causes..
Do you mean oHttp.Open "GET", strUrl, False instead of oHttp.Open "HEAD", strUrl, False ?
Perhaps MSXML2.XMLHTTP30 is not available? You can declare an instance of MSXML2.XMLHTTPX as either early bound or late bound which may impact which version you want to use vs what is available (example http://word.mvps.org/FAQs/InterDev/EarlyvsLateBinding.htm)
eg
Option Explicit
'Dim oHTTPEB As New XMLHTTP30 'For early binding enable reference Microsoft XML, v3.0
Dim oHTTPEB As New XMLHTTP60 'For early binding enable reference Microsoft XML, v6.0
Sub Test()
Dim chk1 As Boolean
Dim chk2 As Boolean
chk1 = CheckHyperlinkLB("http://stackoverflow.com/questions/11647297/xmlhttp-send-request-brings-back-nothing")
chk2 = CheckHyperlinkEB("http://stackoverflow.com/questions/11647297/xmlhttp-send-request-brings-back-nothing")
End Sub
Public Function CheckHyperlinkLB(ByVal strUrl As String) As Boolean
Dim oHTTPLB As Object
'late bound declaration of MSXML2.XMLHTTP30
Set oHTTPLB = CreateObject("Msxml2.XMLHTTP.3.0")
On Error GoTo ErrorHandler
oHTTPLB.Open "GET", strUrl, False
oHTTPLB.send
If Not oHTTPLB.Status = 200 Then CheckHyperlinkLB = False Else CheckHyperlinkLB = True
Set oHTTPLB = Nothing
Exit Function
ErrorHandler:
Set oHTTPLB = Nothing
CheckHyperlinkLB = False
End Function
Public Function CheckHyperlinkEB(ByVal strUrl As String) As Boolean
'early bound declaration of MSXML2.XMLHTTP60
On Error GoTo ErrorHandler
oHTTPEB.Open "GET", strUrl, False
oHTTPEB.send
If Not oHTTPEB.Status = 200 Then CheckHyperlinkEB = False Else CheckHyperlinkEB = True
Set oHTTPEB = Nothing
Exit Function
ErrorHandler:
Set oHTTPEB = Nothing
CheckHyperlinkEB = False
End Function
EDIT:
I tested the OP's link by opening in a browser which I've now discovered redirects to the login page instead so it's a different link I was testing. It's probably failing because the oHttp object has not been set to allow redirects. I know it's possible to set redirects for WinHttp.WinHttpRequest.5.1 using the code below. I would need to investigate if this also works for MSXML2.XMLHTTP30 though.
Option Explicit
Sub Test()
Dim chk1 As Boolean
chk1 = CheckHyperlink("http://portal.emilfrey.ch/portal/page/portal/toyota/30_after_sales/20_ersatzteile%20und%20zubeh%C3%B6r/10_zubeh%C3%B6r/10_produktbezogene%20informationen/10_aussen/10_felgen/10_asa-pr%C3%BCfberichte/iq/tab1357333/iq%20016660.pdf")
End Sub
Public Function CheckHyperlink(ByVal strUrl As String) As Boolean
Dim GetHeader As String
Const WinHttpRequestOption_EnableRedirects = 6
Dim oHttp As Object
Set oHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
On Error GoTo ErrorHandler
oHttp.Option(WinHttpRequestOption_EnableRedirects) = True
oHttp.Open "HEAD", strUrl, False
oHttp.send
If Not oHttp.Status = 200 Then
CheckHyperlink = False
Else
GetHeader = oHttp.getAllResponseHeaders()
CheckHyperlink = True
End If
Exit Function
ErrorHandler:
CheckHyperlink = False
End Function
EDIT2:
MSXML2.XMLHTTP does allow redirects (although I believe MSXML2.ServerXMLHTTP doesn't). The redirects are allowed/disallowed depending upon whether the redirect is cross-domain, cross-port etc (see details here http://msdn.microsoft.com/en-us/library/ms537505(v=vs.85).aspx)
Since the redirect to the login page is cross-domain then IE zone policy is implemented. Open IE/Tools/Internet Options/Security/Custom Level and change 'Access data sources across domains' to ENABLED
The original OP's code will now redirect properly.