Stuck web scraping with VBA - vba

I'm trying to automate a web scraper using VBA to gather price data on certain items. I'm quite new to VBA and have been trying to use answers on similar topics from here to base my code but am stuck because of a "type mismatch". I have this to open IE which works fine:
Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
With appIE
.Navigate "https://grocery.walmart.com/"
.Visible = True
End With
Do While appIE.Busy
DoEvents
Loop
However, I am now hoping to find the prices, i.e. the $1.67 for Colgate and the
$2.78 for Nature Valley in the code below:
<span data-automation-id="items">
<div class="CartItem__itemContainer___3vA-E" tabindex="-1" data-automation-id="cartItem">
<div class="CartItem__itemInfo___3rgQd">
<span class="TileImage__tileImage___35CNo">
<div class="TileImage__imageContainer___tlQZb">
<img alt="1 of C, o" src="https://i5.walmartimages.com/asr/36829cef-43f2-4d21-9d5e-10aa9def01dd_7.04089903cc0038b3dac3c204ef7e417e.png?odnHeight=150&odnWidth=150&odnBg=ffffff" class="TileImage__image___3MrIo" data-automation-id="image" aria-hidden="true">
</div><span data-automation-id="quantity" class="TileImage__quantity___1rgG4 hidden__audiblyHidden___RoAkK" role="button" aria-label="1 of C, select to change quantities">
1</span></span><div class="CartItem__name___2RJs5">
<div data-automation-id="name" tabindex="0" role="button" aria-label="C button, Select to change quantities">
Colgate Cavity Protection Fluoride Toothpaste - 6 oz</div><span data-automation-id="list-price" class="ListPrice__listPrice___1x8TM" aria-label="1 dollar and 67 cents each">
$1.67 each</span><a class="CartItem__detailsLink___2ts9b" aria-label="Colgate Cavity Protection Fluoride Toothpaste - 6 oz" tabindex="0" href="/ip/Colgate-Cavity-Protection-Fluoride-Toothpaste---6-oz/49714957">
View details</a></div><span class="Price__groceryPriceContainer___19Jim CartItem__price___2ADX6" data-automation-id="price" aria-label="1 dollar and 67 cents ">
<sup class="Price__currencySymbol___3Ye7d">
$</sup><span class="Price__wholeUnits___lFhG5" data-automation-id="wholeUnits">
1</span><sup class="Price__partialUnits___1VX5w" data-automation-id="partialUnits">
67</sup></span></div><div></div></div><div class="CartItem__itemContainer___3vA-E" tabindex="-1" data-automation-id="cartItem">
<div class="CartItem__itemInfo___3rgQd">
<span class="TileImage__tileImage___35CNo">
<div class="TileImage__imageContainer___tlQZb">
<img alt="1 of N, a" src="https://i5.walmartimages.com/asr/775482d5-a136-4ca3-9353-28646ec999c3_1.d861ce7abd9797cbafec2cd2a4b24874.jpeg?odnHeight=150&odnWidth=150&odnBg=ffffff" class="TileImage__image___3MrIo" data-automation-id="image" aria-hidden="true">
</div><span data-automation-id="quantity" class="TileImage__quantity___1rgG4 hidden__audiblyHidden___RoAkK" role="button" aria-label="1 of N, select to change quantities">
1</span></span><div class="CartItem__name___2RJs5">
<div data-automation-id="name" tabindex="0" role="button" aria-label="N button, Select to change quantities">
Nature Valley Granola Bars Sweet and Salty Nut Cashew 6 Bars - 1.2 oz</div><span data-automation-id="list-price" class="ListPrice__listPrice___1x8TM" aria-label="2 dollars and 78 cents each">
$2.78 each</span><a class="CartItem__detailsLink___2ts9b" aria-label="Nature Valley Granola Bars Sweet and Salty Nut Cashew 6 Bars - 1.2 oz" tabindex="0" href="/ip/Nature-Valley-Granola-Bars-Sweet-and-Salty-Nut-Cashew-6-Bars---1.2-oz/10311347">
View details</a></div><span class="Price__groceryPriceContainer___19Jim CartItem__price___2ADX6" data-automation-id="price" aria-label="2 dollars and 78 cents ">
<sup class="Price__currencySymbol___3Ye7d">
$</sup><span class="Price__wholeUnits___lFhG5" data-automation-id="wholeUnits">
2</span><sup class="Price__partialUnits___1VX5w" data-automation-id="partialUnits">
78</sup></span></div><div></div></div>
My instinct (as a true beginner) is to find the div class part above and then search for the aria-label and copy the text following it, but I feel like it will be really long-winded and may end up with tonnes of errors if that div class term is repeated elsewhere on the page.
Any help on how I should proceed (and if that is a good idea or not) would be really helpful. Thanks!

All the prices can be selected using a CSS selector targeting the class attribute:
[class='Price__groceryPriceContainer___19Jim CartItem__price___2ADX6']
You would apply the CSS selector via the querySelectorAll method of document which will return a nodeList.
You could alternatively get a collection using:
.document.getElementsByClassName("Price__groceryPriceContainer___19Jim CartItem__price___2ADX6")
Code outline:
Option Explicit
Public Sub TEST()
Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
With appIE
.navigate "https://grocery.walmart.com/" '> Travel to homepage
.Visible = True '< Show browser window
Do While .Busy = True Or .readyState <> 4: DoEvents: Loop '< Wait for page to have loaded
Dim priceList As Object, namesList As Object, i As Long, ws As Worksheet, lastRow As Long
Set ws = ActiveSheet
'Code to get your basket ready
lastRow = GetLastRow(ws, 1)
Set priceList = .document.querySelectorAll("[class='Price__groceryPriceContainer___19Jim CartItem__price___2ADX6']") 'Select elements by their class attribute (match on basket item prices)
Set nameList = .document.querySelectorAll("[ data-automation-id='name']")
For i = 0 To priceList.Length - 1 '< Loop the nodeList of matched elements
With ws
.Cells(lastRow + i + 1, 1) = nameList.item(i).innerText '<access the name of each matched element
.Cells(lastRow + i + 1, 2) = Now
.Cells(lastRow + i + 1, 3) = priceList.item(i).innerText '<access the price of each matched element
End With
Next i
End With
End Sub
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.Rows.count, columnNumber).End(xlUp).Row
End With
End Function
Fixed basket items:
Toothpaste:
If the cart items remain fixed and the prices update in the basket over time you can track changes in toothpaste price, for example, if you use a CSS selector of:
.CartItem__name___2RJs5 + span
So:
Debug.Print .document.querySelector(".CartItem__name___2RJs5 + span").innerText
Or:
Debug.Print .document.querySelectorAll("[class='Price__groceryPriceContainer___19Jim CartItem__price___2ADX6']").item(0).innerText
The last one uses the class attribute to return a nodeList of all matched elements (your basket of items) and accesses the first item (toothpaste) by index 0:
Or you can use .querySelector method which will return the first match i.e. index 0:
Debug.Print .document.querySelector("[class='Price__groceryPriceContainer___19Jim CartItem__price___2ADX6']").innerText
My code is targeting the elements by using a CSS selector (page styling) to match elements on their class attribute. All your basket item prices have the class attribute Price__groceryPriceContainer___19Jim CartItem__price___2ADX6. So my code pulls a nodeList (a bit like an array) of elements back which have this class attribute. The length of the nodeList is traversed to access each element by index (starting from 0). The .innerText property returns the literal string value of the element i.e. the price.

Related

Does IE11 still support dispatchEvent/FireElement?

Where I work we have internal page that used to work with VBA, but I guess since last page or IE update it no longer does.
Macro selects country and then press the button to generate a report for that particular country. Problem is that although the name in dropdown menu changes visually, the actual value does not change (onChange event is not being triggered). I have tried changing the .focus to another dropdown, firing/dispatching events, .SendKeys "~", but to no avail.
This is what I have:
Dim document As HTMLDocument: Set document = ieTab.document
Dim processCnt As HTMLInputElement
Dim eventObj As Object
Dim hwndCounter As Integer
For i = 1 To 8
Set processCnt = ElementTimer(3, ieTab)
Select Case i
Case 1
processCnt.DefaultValue = "London"
Case 2
processCnt.Value = "Prague"
Case 3
processCnt.Value = "WUIB"
Case 4
processCnt.Value = "Zurich"
Case 5
processCnt.DefaultValue = "Norway"
Case 6
processCnt.Value = "Sweden"
Case 7
processCnt.Value = "Poland"
Case 8
processCnt.Value = "Russia"
End Select
Set eventObj = document.createEvent("HTMLEvents")
eventObj.initEvent "keyup", False, True
processCnt.dispatchEvent eventObj
Private Function ElementTimer(number As Integer, ieTab As InternetExplorer)
Dim messageBoxHW As LongPtr
On Error GoTo timer1
1:
Select Case number
Case 1
Set ElementTimer = ieTab.document.getElementsByClassName("rgNoRecords").Item(0)
Case 2
ieTab.document.getElementById("gvReport_ctl00_ctl02_ctl00_ExportToExcelButton").Click
Case 3
Set ElementTimer = ieTab.document.getElementById("ProcessCenter_ID_Input")
If ElementTimer Is Nothing Then GoTo timer1
End Select
On Error GoTo 0
Exit Function
timer1:
DoEvents
messageBoxHW = FindWindow(vbNullString, "Message from webpage")
If messageBoxHW > 0 Then SendMessage messageBoxHW, WM_CLOSE, 0, 0
Resume 1
End Function
The value is being changed before this code.
This is the drop-down Im having problems with.
<!-- 2020.3.1021.45 --><table summary="combobox" border="0" style="border-width:0;border-collapse:collapse;width:100%" class="rcbFocused rcbExpanded">
<tbody><tr class="rcbReadOnly">
<td class="rcbInputCell rcbInputCellLeft" style="width:100%;"><input name="ProcessCenter_ID" type="text" class="rcbInput radPreventDecorate" id="ProcessCenter_ID_Input" value="United States" readonly="readonly" autocomplete="off"></td><td class="rcbArrowCell rcbArrowCellRight"><a id="ProcessCenter_ID_Arrow" style="overflow: hidden;display: block;position: relative;outline: none;">select</a></td>
</tr>
</tbody></table><input id="ProcessCenter_ID_ClientState" name="ProcessCenter_ID_ClientState" type="hidden" autocomplete="off">
</div>
Does this mean that you can no longer simulate change events? Because it sure looks like it after 2 days of googling and trying. Im out of ideas
"Copy JS Path" yielded this: document.querySelector("#ProcessCenter_ID")
To be honest I don't know how to identify the exact JS that is being initiated upon changing the value of a drop-down

Set combo box by name

When using VBA for automatic handling of Internet Explorer, you can select an item in a combo box, by its value. If our combo box HTML looks like this:
<select name="my_combo_box" id="fruits">
<option value="1">Apple</option>
<option value="2">Banana</option>
<option value="3">Strawberry</option>
</select>
Then you can select the option "Banana" using VBA, like this:
getElementById("fruits").value = 2
But is there any way to select it using its name (display member), i.e. Banana?
"Banana" is not a HTML property, but the text between tags.
You could loop through all elements and with option tag and then choose the one with desired innerText. In the next step, you could cut id from innerHTML. Like:
Sub MyMacro()
Dim opt As IHTMLElement
Dim iComboBox As IHTMLElement
Dim sID As String
For Each opt In iComboBox.getElementsByTagName("option")
If InStr(opt.innerHTML, "Banana") Then
sID = CutId(opt.innerHTML)
End If
Next opt
End Sub
Function CutId(s As String) As String
Dim s As String
s = Mid(s, InStr(s, "=") + 2, 1)
CutId = s
End Function
Then you can use sId for selecting item. I guess that your HTML is just example, so it is possible that your will need to adjust CutId function, this is just my proposal for solution. Especially, you will adjust my code if you expect id to have two digits.

VBA/DOM - Get elements based on attribute

Excel 2013 on Windows 7. XPath/Javascript/jQuery is out of scope.
I am trying to iterate over select div elements in a page, namely elements that have a specific data-level attribute.
My current approach is similar to this, but I was unable to find a non-manual way to select elements based on attributes. The closest I came was something like:
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", url, False
.Send
pHTML.body.innerHTML = .ResponseText
End With
Set eCollection = pHTML.getElementsByClassName("chapter").getElementsByTagName("div")
For i = 0 To eCollection.Length
If eCollection(i).getAttribute("data-level") >= 0 Then ' Throw cake
Next i
This solution, while I am sure it is viable (if unelegant), seems sub-optimal if only for how big the loop is going to end up being when I start looking for specific elements and sequences of elements within these elements.
So I am looking for a way to do something like this:
For Each pElement In pHTML.getElementsByClassName("chapter").getElementsByTagName("div").getElementsByAttribute("data-level")
' Throw cake at the element
Next
I'm aware that there is no method getElementsByAttribute, hence the question.
Is there some approach here that I am blind to, or am I locked to manual iteration?
Alternatively, if I swap my current approach for creating an IE instance, รก la this answer, could I concievably use querySelectorAll to end up with something resembling the result I have outlined above?
For anyone else coming this way, the outer shell, so to speak, can look like this:
Sub ScrapeWithHTMLObj(url As String, domClassName As String, domTag As String, domAttribute As String, domAttributeValue As String)
' Dependencies:
' * Microsoft HTML Object Library
' Declare vars
Dim pHTML As HTMLDocument
Dim pElements As Object, pElement As Object
Set pHTML = New HTMLDocument
' Basic URL healthcheck
Do While (url = "" Or (Left(url, 7) <> "http://" And Left(url, 8) <> "https://"))
MsgBox ("Invalid URL!")
url = InputBox("Enter new URL: (0 to terminate)")
If url = "0" Then Exit Sub
Loop
' Fetch page at URL
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", url, False
.Send
pHTML.body.innerHTML = .ResponseText
End With
' Declare page elements
Set pElements = pHTML.getElementsByClassName(domClassName)
Set pElement = pElements(0).getElementsByTagName(domTag)
' Extract only elements with wanted attribute
pEleArray = getElementsByAttribute(pElement, domAttribute, domAttributeValue)
For Each e In pEleArray
' Do stuff to elements
Debug.Print e.getAttribute(domAttribute)
Next
End Sub
If you go this route, you'll also need something like this:
Function getElementsByAttribute(pObj As Object, domAttribute As String, domAttributeValue As String) As Object()
Dim oTemp() As Object
ReDim oTemp(1 To 1)
For i = 0 To pObj.Length - 1
'Debug.Print pObj(i).getAttribute(domAttribute)
If pObj(i).getAttribute(domAttribute) = domAttributeValue Then
Set oTemp(UBound(oTemp)) = pObj(i)
ReDim Preserve oTemp(1 To UBound(oTemp) + 1)
End If
Next i
ReDim Preserve oTemp(1 To UBound(oTemp) - 1)
getElementsByAttribute = oTemp
End Function
Depending on the HTML tree, you'll need to change which elements you zero in on in the sub, obviously. For the site I used in testing, this structure worked flawlessly.
Example usage:
Call ScrapeWithHTMLObj("https://somesite", "chapter-index", "div", "data-level", "1")
It will enter the first class named chapter-index, select all elements with the div tag, and finally extract all elements containing the attribute data-level with value 1.

VBA and Aria HTML Control

I have a seemingly simple but aggravating question. I am writing VBA code to control a slider bar on a webpage. This should be straightforward process like filling various types of check boxes and input boxes but I am running into a coding error. The slider bar has a high and low value that I would like to set (denoted by Price).
Here is my code...
Function Master
Dim Elem As Object ' Object used in For loop.
Dim e As Object ' Object used in For loop.
Dim Element As Object ' Object used to click buttons for search.
Dim PriceLow As Integer
Dim PriceHigh As Integer
Set objShell = CreateObject("Shell.Application")
Set objWindow = objShell.Windows()
Set Elem = objIEApp.Document.getElementsByTagName("div")
For Each e In Elem
If e.role = "slider" And e.tabindex = 1 Then
e.aria-valuenow = Pricelow
ElseIf e.role = "slider" And e.tabindex = 2 Then
e.aria-valuenow = Pricehigh
End If
Next e
End Function
This is the HTML code for the low Price (without the <> brackets)
div
tabindex="1" class="airslide-handle" role="slider" aria-valuenow="10"
aria-valuemin="10" aria-valuemax="1000" style="left: 0%;"
data-reactid=".4.1.1:$0" abp="340" data-handle-key="0">
/div
And for the high Price (same with <>)
div tabindex="2" class="airslide-handle" role="slider" aria-valuenow="1000"
aria-valuemin="10" aria-valuemax="1000" style="left: 100%;"
data-reactid=".4.1.1:$1" abp="341" data-handle-key="1">
/div
When I type in the code lines "e.aria-valuenow = Pricehigh" and "e.aria-valuenow = Pricelow" though, VBA auto edits to make my code "e. Aria - valuenow = Pricehigh" and "e.Aria - valuenow = Pricelow" which obviously causes a compiling error in VBA. How do I fix or get around this?
just write "ariaValuenow " without the "-". So
If e.role = "slider" And e.tabindex = 1 Then
e.ariaValuenow = Pricelow
ElseIf e.role = "slider" And e.tabindex = 2 Then
e.ariaValuenow = Pricehigh
End If
Cheers

Automate click(expanding) plus signs within internet explorer

I am working on automating some data entry into an intranet web page. I have had success with this type of code in the past to click checkboxes, but have been unable to make it work on the plus signs that expand the rows. The below code does nothing, no error is prompted either, the code just runs it's course.
Here is my code:
Set div = IE.document.getElementsByTagName("div")
For Each i In div
'showExpand?
If i.id Like "iconShowA*" Then
If i.onclick = "showExpand(*)" Then
i.Click'Click plus sign
v = Replace(i.id, "iconShowA", "")
col.Add v 'store the numeric part
End If
End If
Next i
For Each v In col
Debug.Print v
Next v
The pertinent HTML lines are:
(What I'm trying to click, there can be a variable number of these with a different numerical identifier "iconShowA(x)")
<div id="iconShowA34" class="ui-icon ui-icon-plusthick" onclick="showExpand('34','34')" ;="" style="display: block;"></div>
(I also need to avoid clicking these)
<div id="iconShowA_N4" class="ui-icon ui-icon-plusthick" onclick="showExpandSub('4','4')" ;=""></div>
The code below was able to achieve desired results. I was unable to make the TagName convention work. This method uses getElementByID to navigate through the webpage. It seemed crucial that the full ID be used, so I used the Do While loop to iterate through numbers that were possible numbers used in the ID naming convention.
n = DateDiff("ww", firstDate, secondDate)'Determines number of plus' to click
v = 0 'Counter for plus click event
x = 6 ' starting value for numerical piece of Id
Do While n > v 'continue loop until all plus' have been clicked
Set div = IE.document.getElementById("iconShowA" & x) 'Id must be defined completely to click
If div Is Nothing Then 'tests if element exists
x = x + 7
Else
div.Click 'clicks an element that exists
v = v + 1
x = x + 7 'iterates id number by 7 as that is convention of website
End If
Loop
CSS selectors:
Assuming all the elements you want to click have showExpandSub and not showExpand then you can use a CSS selector to select these elements:
div[onclick^='showExpand(']
This says elements with div tag having attribute onclick with value starting with 'showExpand('.
CSS query:
VBA:
Use the querySelectorAll method of document to return a nodeList of all matching elements. You then loop the .Length to retrieve elements.
Dim aNodeList As Object, iNode As Long
Set aNodeList = ie.document.querySelectorAll("div[onclick^='showExpand(']")
For iNode = 0 To aNodeList.Length - 1
Debug.Print aNodeList.item(iNode).innerText
'Debug.Print aNodeList(iNode).innerText '<== Sometimes this syntax
Next iNode