Can't use querySelector in the right way in vba - vba

I've written some code using vba to get all the movie names from a specific webpage out of a torrent site. However, pressing "F8" I could find out that the code works well and prints the results until it hits the last result from that page. As soon as it reaches the last name to parse, the program crashes. I did several times and suffered the same consequences. If vba doesn't support this css selector method then how could I collect results before the last one? Is there any reference to add in the library or something else before execution? Any help on this will be vastly appreciated.
Here is the code I have written:
Sub Torrent_data()
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim movie_name As Object, movie As Object
With http
.Open "GET", "https://www.yify-torrent.org/search/1080p/", False
.send
html.body.innerHTML = .responseText
End With
Set movie_name = html.querySelectorAll("div.mv h3 a")
For Each movie In movie_name
x = x + 1: Cells(x, 1) = movie.innerText
Next movie
End Sub

Try this:
Sub Torrent_data()
Dim http As New XMLHTTP60, html As New HTMLDocument, x As Long
With http
.Open "GET", "https://www.yify-torrent.org/search/1080p/", False
.send
html.body.innerHTML = .responseText
End With
Do
x = x + 1
On Error Resume Next
Cells(x, 1) = html.querySelectorAll("div.mv h3 a")(x - 1).innerText
Loop Until Err.Number = 91
End Sub
This is another way which doesn't need error handler:
Sub GetContent()
Const URL$ = "https://yify-torrent.cc/search/1080p/"
Dim HTMLDoc As New HTMLDocument, R&, I&
With New ServerXMLHTTP60
.Open "Get", URL, False
.send
HTMLDoc.body.innerHTML = .responseText
End With
With HTMLDoc.querySelectorAll("h3 > a.movielink")
For I = 0 To .Length - 1
R = R + 1: Cells(R, 1).Value = .Item(I).innerText
Next I
End With
End Sub

the code retrieves one element after the last movie
this extra element causes the failure, so for each ... cannot be used
not sure why ... yet .... will update
Sub Torrent_data()
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim movie_name As Object, movie As Object
With http
.Open "GET", "https://www.yify-torrent.org/search/1080p/", False
.send
html.body.innerHTML = .responseText
End With
Set movie_name = html.querySelectorAll("div.mv h3 a")
Dim i As Integer
For i = 0 To movie_name.Length - 1
Cells(x + i, 1) = movie_name(i).innerText
Next i
End Sub

looks like querySelectorAll has an issue of some sort
the object html.querySelectorAll(".mv h3 a") cannot be examined in Watch window.
attempting to do so crashes excel or word (i tried both)
tried other tags, same result
Sub Torrent_data()
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim movie_name As Object, movie As Object
With http
.Open "GET", "https://www.yify-torrent.org/search/1080p/", False
.send
html.body.innerHTML = .responseText
End With
' Set movie_name = html.querySelectorAll("div.mv h3 a") ' querySelectorAll crashes VBA when trying to examine movie_name object
Set movie_name = html.getElementsByClassName("mv") ' HTMLElementCollection
For Each movie In movie_name
x = x + 1: Cells(x, 1) = movie.getElementsByTagName("a")(1).innerText
Next movie
' HTML block for each movie looks like this
' <div class="mv">
' <h3>
' Smoke (1995) 1080p
' </h3>
' <div class="movie">
' <div class="movie-image">
' <a href="/movie/55346/download-smoke-1995-1080p-mp4-yify-torrent.html" target="_blank" title="Download Smoke (1995) 1080p">
' <span class="play"><span class="name">Smoke (1995) 1080p</span></span>
' <img src="//pic.yify-torrent.org/20170820/55346/smoke-1995-1080p-poster.jpg" alt="Smoke (1995) 1080p" />
' </a>
' </div>
' </div>
' <div class="mdif">
' <ul>
' <li><b>Genre:</b>Comedy</li><li><b>Quality:</b>1080p</li><li><b>Screen:</b>1920x1040</li><li><b>Size:</b>2.14G</li><li><b>Rating:</b>7.4/10</li><li><b>Peers:</b>2</li><li><b>Seeds:</b>0</li>
' </ul>
' Download
' </div>
' </div>
End Sub

I know this old, but I managed on how to use querySelectorAll without crashes my IE.
Instead of using For-each I used For Loop
Example below:
Dim priceData as Object
Set priceData = IE.document.getElementsByClassName("list-flights")(0).querySelectorAll("[class$='price']")
For i = 0 to priceData.Length - 1
Debug.Print priceData.item(i).getElementsByClassName("cash js_linkInsideCell")(0).innerHTML
Next i

Related

getElementsByClassName() failed in my VBA

contents in Webpage as follow:
<pre>
<div class="catdivlogo">
<img src="//static.designandreuse.com/sip/logo/rambus2.gif" border="0" class="catlogo" alt="Rambus, Inc.">
</div>
</pre>
I failed to get the middle line by .getElementsByClASSnAME("catlogo"). I can only get the first line by .getElementsByClASSnAME("catlogo").
Anybody can help this?
My VBA as follow:
Sub Test()
Const URL As String = "https://www.design-reuse.com/sip/pci-express-c-430/"
Const CLASS_NAME As String = "catlogo"
Dim oDoc As MSHTML.HTMLDocument
Dim results As Object
With CreateObject("MSXML2.XMLHttp")
.Open "GET", URL, False
.send
'Check the response is valid
If .Status = 200 Then
Set oDoc = CreateObject("htmlfile")
oDoc.body.innerHTML = .responseText
Set results = oDoc.getElementsByClassName(CLASS_NAME)
End If
End With
End Sub

Unable to use querySelector within querySelectorAll container in the right way

I'm trying to figure out how I can use .querySelector() on .querySelectorAll().
For example, I get expected results when I try like this:
Sub GetContent()
Const URL$ = "https://stackoverflow.com/questions/tagged/web-scraping?tab=Newest"
Dim HTMLDoc As New HTMLDocument
Dim HTML As New HTMLDocument, R&, I&
With New XMLHTTP60
.Open "Get", URL, False
.send
HTMLDoc.body.innerHTML = .responseText
End With
With HTMLDoc.querySelectorAll(".summary")
For I = 0 To .Length - 1
HTML.body.innerHTML = .Item(I).outerHTML
R = R + 1: Cells(R, 1).Value = HTML.querySelector(".question-hyperlink").innerText
Next I
End With
End Sub
The script doesn't work anymore when I pick another site in order to grab the values under Rank column available in the table even when I use the same logic:
Sub GetContent()
Const URL$ = "https://www.worldathletics.org/records/toplists/sprints/100-metres/outdoor/men/senior/2020?page=1"
Dim HTMLDoc As New HTMLDocument
Dim HTML As New HTMLDocument, R&, I&
With New XMLHTTP60
.Open "Get", URL, False
.send
HTMLDoc.body.innerHTML = .responseText
End With
With HTMLDoc.querySelectorAll("#toplists tbody tr")
For I = 0 To .Length - 1
HTML.body.innerHTML = .Item(I).outerHTML
R = R + 1: Cells(R, 1).Value = HTML.querySelector("td").innerText
Next I
End With
End Sub
This is the line Cells(R, 1).Value = HTML.querySelector().innerText In both the script I'm talking about. I'm using the same within this container .querySelectorAll().
If I use .querySelector() on .getElementsByTagName(), I found it working. I also found success using TagName on TagName or ClassName on ClassName e.t.c. So, I can grab the content in few different ways.
How can I use .querySelector() on .querySelectorAll() in the second script in order for it to work?
Wrap it in table tags so the html parser knows what to do with it.
HTML.body.innerHTML = "<table>" & .Item(I).outerHTML & "</table>"
Doing so preserves the structure of the opening td tag which is otherwise stripped of the "<".

Error when changing IE automation code to XML

I recently started working with XML automation and after changing some basic IE automation code over, I seem to be getting an error. Here's the HTML:
<tbody>
<tr class="group-2 first">
<td class="date-col">
<a href="/stats/matches/mapstatsid/48606/teamone-vs-merciless">
<div class="time" data-time-format="d/M/yy" data-unix="1498593600000">27/6/17</div>
</a>
</td>
......SOME MORE HTML HERE......
</tr>
......SOME MORE HTML HERE......
</tbody>
And here's the code i'm using in Excel VBA:
Sub readData()
Dim XMLPage As New MSXML2.XMLHTTP60
Dim html As New MSHTML.HTMLDocument
XMLPage.Open "GET", "https://www.hltv.org/stats/matches", False
XMLPage.send
If XMLPage.Status <> 200 Then MsgBox XMLPage.statusText
html.body.innerHTML = XMLPage.responseText
For Each profile In html.getElementsByTagName("tbody")(0).Children
Debug.Print profile.getElementsByClassName("date-col")(0).getElementsByTagName("a")(0).getAttribute("href") 'Run time error '438' here
Next
End Sub
I'm getting the Run time error '438' at the debug print code. seems to be happening when getting the class but i'm unsure why. It works fine if I use this for example:
Debug.Print profile.innertext
Worked for me:
Sub readData()
Dim XMLPage As New MSXML2.XMLHTTP60
Dim html As New MSHTML.HTMLDocument, links, a, i
XMLPage.Open "GET", "https://www.hltv.org/stats/matches", False
XMLPage.send
If XMLPage.Status <> 200 Then MsgBox XMLPage.statusText
html.body.innerHTML = XMLPage.responseText
Set links = html.querySelectorAll("td.date-col > a")
Debug.Print links.Length
For i = 0 To links.Length - 1
Debug.Print links(i).href
Next
Set links = Nothing
Set html = Nothing
End Sub
FYI when I used For Each to loop over the links collection Excel would reliably crash, so I'd stay with the loop shown
profile refers to a row, and profile.cells(0) will refer to the first column in that row. So try...
profile.cells(0).getElementsByTagName("a")(0).getAttribute("href")
Also, profile should be declared as HTMLTableRow.
The URL you are using isn't serving valid XML, but it's recoverable with some simple regex replacements. Once we have some valid XML, we can load that into a DOM document and use XPath to select the nodes as required:
Option Explicit
'Add references to:
' - MSXML v3
' - Microsoft VBScript Regular Expressions 5.5
Sub test()
Const START_MARKER As String = "<table class=""stats-table matches-table"">"
Const END_MARKER As String = "</table>"
With New MSXML2.XMLHTTP
.Open "GET", "https://www.hltv.org/stats/matches", False
.send
If .Status = 200 Then
'The HTML isn't valid XHTML, so we can't just use the http.XMLResponse DOMDocument
'Let's extract the HTML table
Dim tableStart As Long
tableStart = InStr(.responseText, START_MARKER)
Dim tableEnd As Long
tableEnd = InStr(tableStart, .responseText, END_MARKER)
Dim tableHTML As String
tableHTML = Mid$(.responseText, tableStart, tableEnd - tableStart + Len(END_MARKER))
'The HTML table has invalid img tags (let's add a closing tag with some regex)
With New RegExp
.Global = True
.Pattern = "(\<img [\W\w]*?)"">"
Dim tableXML As String
tableXML = .Replace(tableHTML, "$1"" />")
End With
'And load an XML document from the cleaned up HTML fragment
Dim doc As MSXML2.DOMDocument
Set doc = New MSXML2.DOMDocument
doc.LoadXML tableXML
End If
End With
If Not doc Is Nothing Then
'Use XPath to select the nodes we need
Dim nodes As MSXML2.IXMLDOMSelection
Set nodes = doc.SelectNodes("//td[#class='date-col']/a/#href")
'Enumerate the URLs
Dim node As IXMLDOMAttribute
For Each node In nodes
Debug.Print node.nodeTypedValue
Next node
End If
End Sub
Output:
/stats/matches/mapstatsid/48606/teamone-vs-merciless
/stats/matches/mapstatsid/48607/merciless-vs-teamone
/stats/matches/mapstatsid/48608/merciless-vs-teamone
/stats/matches/mapstatsid/48600/wysix-vs-fnatic-academy
/stats/matches/mapstatsid/48602/skitlite-vs-nexus
/stats/matches/mapstatsid/48604/extatus-vs-forcebuy
/stats/matches/mapstatsid/48605/extatus-vs-forcebuy
/stats/matches/mapstatsid/48599/planetkey-vs-gatekeepers
/stats/matches/mapstatsid/48603/gatekeepers-vs-planetkey
/stats/matches/mapstatsid/48595/wysix-vs-gambit
/stats/matches/mapstatsid/48596/kinguin-vs-playing-ducks
/stats/matches/mapstatsid/48597/spirit-academy-vs-tgfirestorm
/stats/matches/mapstatsid/48601/spirit-academy-vs-tgfirestorm
/stats/matches/mapstatsid/48593/fnatic-academy-vs-gambit
/stats/matches/mapstatsid/48594/alternate-attax-vs-nexus
/stats/matches/mapstatsid/48590/pro100-vs-playing-ducks
/stats/matches/mapstatsid/48583/extatus-vs-ex-fury
/stats/matches/mapstatsid/48589/extatus-vs-ex-fury
/stats/matches/mapstatsid/48584/onlinerol-vs-forcebuy
/stats/matches/mapstatsid/48591/forcebuy-vs-onlinerol
/stats/matches/mapstatsid/48581/epg-vs-veni-vidi-vici
/stats/matches/mapstatsid/48588/epg-vs-veni-vidi-vici
/stats/matches/mapstatsid/48592/veni-vidi-vici-vs-epg
/stats/matches/mapstatsid/48582/log-vs-gatekeepers
/stats/matches/mapstatsid/48586/gatekeepers-vs-log
/stats/matches/mapstatsid/48580/spraynpray-vs-epg
/stats/matches/mapstatsid/48579/quantum-bellator-fire-vs-spraynpray
/stats/matches/mapstatsid/48571/noxide-vs-masterminds
/stats/matches/mapstatsid/48572/athletico-vs-legacy
/stats/matches/mapstatsid/48578/node-vs-avant
/stats/matches/mapstatsid/48573/funky-monkeys-vs-grayhound
/stats/matches/mapstatsid/48574/grayhound-vs-funky-monkeys
/stats/matches/mapstatsid/48575/hegemonyperson-vs-eclipseo
/stats/matches/mapstatsid/48577/eclipseo-vs-hegemonyperson
/stats/matches/mapstatsid/48566/masterminds-vs-tainted-black
/stats/matches/mapstatsid/48562/grayhound-vs-legacy
/stats/matches/mapstatsid/48563/noxide-vs-riotous-raccoons
/stats/matches/mapstatsid/48564/avant-vs-dark-sided
/stats/matches/mapstatsid/48565/avant-vs-dark-sided
/stats/matches/mapstatsid/48567/eclipseo-vs-uya
/stats/matches/mapstatsid/48568/uya-vs-eclipseo
/stats/matches/mapstatsid/48560/uya-vs-new4
/stats/matches/mapstatsid/48561/new4-vs-uya
/stats/matches/mapstatsid/48559/jaguar-sa-vs-miami-flamingos
/stats/matches/mapstatsid/48558/spartak-vs-binary-dragons
/stats/matches/mapstatsid/48557/kungar-vs-spartak
/stats/matches/mapstatsid/48556/igamecom-vs-fragsters
/stats/matches/mapstatsid/48554/nordic-warthogs-vs-aligon
/stats/matches/mapstatsid/48555/binary-dragons-vs-kungar
/stats/matches/mapstatsid/48550/havu-vs-rogue-academy
Looking at the MSHTML.HTMLDocument reference there is no method getElementsByClassName.
You will need to loop through each row in the tbody you are selecting and then get the first td in that row and then get the first link in that td and read the href attribute from it. You could alternately compare the class attribute of the td but since it is the first element in the row there is no need to do that.

Unable to go deep for certain barriers in a multilayered webpage to fetch data

How to reach the last layer of a webpage starting from the first page? I tried but got stuck. Every time I run my code to go deep it crawls the same page again and again. Finally , I made it. Here is the full code.
Sub bjscrawler()
Const url = "http://www.bjs.com"
Dim html As New HTMLDocument, htm As New HTMLDocument
Dim topics As Object, post As Object, topic As Object, newlinks As String
Dim links As Object, link As Object, data As Object
With CreateObject("MSXML2.serverXMLHTTP")
.Open "GET", url, False
.setRequestHeader "Content-Type", "text/xml"
.send
html.body.innerHTML = .responseText
End With
Set topics = html.getElementsByClassName("text")
For Each post In topics
Set topic = post.getElementsByTagName("a")(0)
newlinks = url & Split(topic.href, ":")(1)
With CreateObject("MSXML2.serverXMLHTTP")
.Open "GET", newlinks, False
.send
htm.body.innerHTML = .responseText
End With
Set links = htm.getElementsByClassName("rightView")
For Each link In links
Set data = link.getElementsByTagName("h1")(0)
x = x + 1
Cells(x, 1) = data.innerText
Next link
Next post
End Sub
In the code:
For m = 0 To mla.Length - 1
z = mla(m).getAttribute("href")
link = pageurl & Mid(z, InStr(z, ":") + 1)
Next m
link will only contain the last url of mla. All the other ones are "gone".
Also check the url you created in link, it can be invalid. As a result, the next GET wil fail, but the code doesn't check that and just "carries on". http.responseText will for example be 404 page not found, the call hmm.getElementsByClassName will return an empty set and For Each fla will be an emty loop.
In the code:
If cc <> "" Then
refinedlinks = cc
End If
validlinks = refinedlinks
Cells(x, 1) = validlinks
x = x + 1
you fill the cell also when cc was empty, which generates duplicates. Change to:
If cc <> "" Then
Cells(x, 1) = cc
x = x + 1
End If
When you say
''' I'm stuck at this point. Not i can pull links from here nor can go
'''deeper. Because object elements are not same for all the links.
you probably want to process all the cells you just filled, not only this last validlinks. So iterate over the cells:
lastx= x
For x= 1 to lastx
http.Open "GET", Cells(x, 1), False
I am not sure what you mean with "Because object elements are not same for all the links". I hope these suggestions help you.

Retrieving data from the web using vba

Just started using html, reasonably capable in vba but having some problems linking the two.
I have passed a registration to a web site and trying to get the results.
code used so far
Dim HTMLDoc As HTMLDocument
Dim MyBrowser As InternetExplorer
Sub GetVehicleDetails()
Dim MyHTML_Element As IHTMLElement
Dim MyURL As String
Dim x As Integer
On Error GoTo Err_Clear
MyURL = "http://www.1stchoice.co.uk/find-a-part"
x = 0
Set MyBrowser = New InternetExplorer
MyBrowser.Silent = True
MyBrowser.navigate MyURL
MyBrowser.Visible = True
Do
Loop Until MyBrowser.readyState = READYSTATE_COMPLETE
Set HTMLDoc = MyBrowser.document
HTMLDoc.all.license_plate.Value = "LV11VYT"
For Each MyHTML_Element In HTMLDoc.getElementsByTagName("button") '("input")
'Get 2nd button
If MyHTML_Element.Title = "Continue" Then 'MyHTML_Element.Click: Exit For
x = x + 1
If x = 2 Then
MyHTML_Element.Click
End If
End If
Next
Err_Clear:
If Err <> 0 Then
Err.Clear
Resume Next
End If
End Sub
Now I need to wait until page is refreshed and then get the result, but I'm not certain how to pull the result out
Source code is
<div id="block_subheader" class="block_editable block_wysiwyg">
<p>Almost there! <strong>TELL US</strong> which parts you need - <strong>ADD </strong>your contact details & receive <strong>No Obligation Quotes</strong><span style="font-weight: normal;"> to compare & </span><span style="font-weight: normal;"><strong>Save ££'s!</strong></span></p>
</div>
<div class="clear"></div>
<form id="step3" action="/find-a-part/step-3" method="post" enctype="multipart/form-data">
<div class="clearfix">
<h2>RENAULT MEGANE (X95) DYNAMIQUE TOMTOM DCI ECO 3 DOOR COUPE 1461cc (2011) DIESEL</h2>
<p>Not quite the vehicle you're searching for? Click here to specify the vehicle exactly</p>
</div>
Trying to get the Renault Megane details
Can anyone help please?
OK I have got past this part but have run into another problem, when the page changes after the button is clicked I need to update the html.document to the new page as when I use it in the code it pulls up the old source code.
I can get it to work but It only works with a message box activating to say what the browser name is.
Any suggestions?
Dim HTMLDoc As HTMLDocument
Dim MyBrowser As InternetExplorer
Sub GetVehicleDetails2()
Dim MyHTML_Element As IHTMLElement
Dim HTMLDoc As HTMLDocument, Doc As HTMLDocument
Dim MyURL As String, Vehicle As String
Dim x As Integer, y As Integer
On Error GoTo Err_Clear
MyURL = "http://www.1stchoice.co.uk/find-a-part"
x = 0
'open new explorer
Set MyBrowser = New InternetExplorer
MyBrowser.Silent = True
'navigate to page
MyBrowser.navigate MyURL
MyBrowser.Visible = True
'wait until ready
Do While MyBrowser.Busy Or _
MyBrowser.readyState <> 4
DoEvents
Loop
Do
Loop Until MyBrowser.readyState = READYSTATE_COMPLETE
Set HTMLDoc = MyBrowser.document
'enter registration in text box
HTMLDoc.all.license_plate.Value = "LV11VYT"
'click continue button
Set MyHTML_Element = HTMLDoc.getElementsByTagName("button")(1)
MyHTML_Element.Click
Set HTMLDoc = Nothing
'wait until page updated
Set Doc = MyBrowser.document
'Application.Wait (Now() + "00:00:05")
'does not work if you take this out
MsgBox MyBrowser.FullName
'find text returned with vehicle details
For Each MyHTML_Element In Doc.getElementsByTagName("form")
If MyHTML_Element.ID = "step3" Then
Vehicle = MyHTML_Element.innerText
MsgBox Vehicle
End If
Next
'close browser down
'MyBrowser.Quit
Err_Clear:
If Err <> 0 Then
Err.Clear
Resume Next
End If
End Sub
using 2003 or 2007, tried web queries, cant pass value & use continue button.
Without trying to start an argument over extracting an element from HTML using Regex (vs. a parser) but Regex would be an easy way to extract the element you need, as it is well-defined and you only need that element.
You could do something like (and I provide an alternative way just using InStr, that works for your example but if there are lots of results returned at once or syntax changes etc then Regex would be more flexible):
Sub blah()
Dim testStr As String
'test string you provided in the Question -> substitute it for your HTML return
testStr = ActiveSheet.Cells(1, 1).Value
'Method 1: Use a simple Instr (fine for the example you provided, but if different bits you need to search are more complicated then you may need to use Regex instead
Dim startLocation As Long, endLocation As Long
Dim extractedText As String
startLocation = InStr(1, testStr, "<h2>", vbTextCompare)
If Not startLocation > 0 Then
Exit Sub 'or move to next or whatever
Else
endLocation = InStr(startLocation, testStr, "</h2>", vbTextCompare)
extractedText = Mid(testStr, startLocation + 4, endLocation - startLocation - 4)
Debug.Print "Basic InStr method: "; extractedText
End If
'Method 2: Use Regex
'more flexible -> reference a Regex engine.
'This example uses Microsoft VBScript Regular Expressions 5.5
'That engine uses the same syntax as MS JavaScript regex
'See http://msdn.microsoft.com/en-us/library/1400241x.aspx for syntax
Dim regex As RegExp
Dim match As match
Set regex = New RegExp
With regex
.Pattern = "(?:<h2>)([\s\S]*?)(?=</h2>)"
'NB this regex engine does not support lookbehinds :-(
'so we have to extract the submatched group for what we want
'(vs. just using Match.Value)
.IgnoreCase = True
.MultiLine = True
For Each match In .Execute(testStr)
Debug.Print "Regex match: "; match.SubMatches.Item(0)
Next match
End With
End Sub
Output is:
Basic InStr method: RENAULT MEGANE (X95) DYNAMIQUE TOMTOM DCI ECO 3 DOOR COUPE 1461cc (2011) DIESEL
Regex match: RENAULT MEGANE (X95) DYNAMIQUE TOMTOM DCI ECO 3 DOOR COUPE 1461cc (2011) DIESEL