VBA HTML Scraping - '.innertext' from complex table - vba

All,
I've created the following Module to scrape a single value (1m % change in London house prices) from the below address:
https://www.hometrack.com/uk/insight/uk-cities-house-price-index/
The specific value is nested within the following code:
The below VBA code is my attempt at scraping. I, perhaps wrongly, feel that I am very close to capturing the value - but the code will not work.
Does anyone know where I am going wrong here? It doesn't show an error message but also doesn't output any values.
Sub HousePriceData()
Dim wb As Workbook
Dim ws As Worksheet
Dim TxtRng As Range
Dim ie As Object
Dim V As Variant
Dim myValue As Variant
Set ie = CreateObject("INTERNETEXPLORER.APPLICATION")
ie.NAVIGATE "https://www.hometrack.com/uk/insight/uk-cities-house-price-index/"
ie.Visible = False
While ie.ReadyState <> 4
DoEvents
Wend
Set wb = ActiveWorkbook
Set ws = wb.Sheets("Input")
Set TxtRng = ws.Range("C15")
Set myValue = ie.document.getElementById("cities-index-table").getElementsByTagName("tr")(7).g‌​etElementsByTagName("td")(5)
TxtRng = myValue.innerText
End Sub

Try to use XHR and primitive parsing instead of awkward IE:
Sub Test()
Dim strUrl As String
Dim strTmp As String
Dim arrTmp As Variant
strUrl = "https://www.hometrack.com/uk/insight/uk-cities-house-price-index/"
With CreateObject("MSXML2.XMLHttp")
.Open "GET", strUrl, False
.Send ""
strTmp = .ResponseText
End With
arrTmp = Split(strTmp, ">London</a></td>", 2)
strTmp = arrTmp(1)
arrTmp = Split(strTmp, "<td>", 7)
strTmp = arrTmp(6)
arrTmp = Split(strTmp, "</td>", 2)
strTmp = arrTmp(0)
ThisWorkbook.Sheets("Input").Range("C15").Value = strTmp
End Sub

try use this
Dim Engmt As String
Engmt = "ERRORHERE"
On Error Resume Next
Engmt = Trim(ie.document.getElementById("cities-index- table").getElementsByTagName("tr")(12).g‌​etElementsByTagName("td")(4).innerText)
On Error GoTo 0
If Engmt = "ERRORHERE" Then
TxtRng.Value = "ERROR"
Else
TxtRng.Value = Engmt
End If

Related

Data Scraping from Website - receiving error on different player pages

I'm trying to data scrape game logs from basketball reference.com. It worked perfectly on two players I choose (demar derozan and lamarcus aldridge). But then I started going through other players and it just wouldn't scrape the data for many of the other players (Kevin Durant).
I have NO Idea why it wouldn't work. For example, I tried Stephen Curry and it worked fine, but players like Draymond Green and Kevin Durant, the code would just not scrape the data at all. For some reason after the column Date, everything stopped working.
Sub Data()
Dim ieObj As InternetExplorer
Dim htmlEle As IHTMLElement
Dim i As Integer
i = 1
Set ieObj = New InternetExplorer
ieObj.Visible = True
ieObj.navigate "https://www.basketball-reference.com/players/d/duranke01/gamelog/2019"
Application.ScreenUpdating = False
Application.DisplayStatusBar = True
Application.Wait Now + TimeValue("00:00:05")
For Each htmlEle In ieObj.document.getElementsByClassName("stats_table")(0).getElementsByTagName("tr")
With ActiveSheet
.Range("A" & i).Value = htmlEle.Children(0).textContent
.Range("B" & i).Value = htmlEle.Children(1).textContent
.Range("C" & i).Value = htmlEle.Children(2).textContent
.Range("D" & i).Value = htmlEle.Children(3).textContent
End With
i = i + 1
On Error Resume Next
Next htmlEle
End Sub
The error always happens on this line:
.Range("D" & i).Value = htmlEle.Children(3).textContent
I tried skipping columns to but it still wouldn't work.
I found no problem with using id for table, as seen in .responseText and using xmlhttp over browser.
I only tested with 3 urls - in sheet 1 A1:A3
https://www.basketball-reference.com/players/d/duranke01/gamelog/2019
https://www.basketball-reference.com/players/c/curryst01/gamelog/2019
https://www.basketball-reference.com/players/g/greendr01/gamelog/2019
With this site sometimes you can encounter tables inside comments so I stripped out the comment tags before processing. This was not necessary for the links I tried.
I use clipboard to copy paste but you could just use
Set hTable = html.getElementById("pgl_basic")
Then loop the tr and td as you wish using getElementsByTagName.
References (VBE>Tools>References):
Microsoft HTML Object Library
Option Explicit
Public Sub GetPlayerInfo()
Dim urls(), i As Long, html As HTMLDocument, hTable As Object
Dim ws As Worksheet, wsCurrent As Object, clipboard As Object
Dim lastRow As Long, playerIdentifier As String, arr() As String
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
Set ws = ThisWorkbook.Worksheets("Sheet1")
lastRow = ws.Cells(ws.rows.Count, "A").End(xlUp).Row
urls = Application.Transpose(ws.Range("A1:A" & lastRow).Value)
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
For i = LBound(urls) To UBound(urls)
.Open "GET", urls(i), False
.send
html.body.innerHTML = Replace$(Replace$(.responseText, "-->", vbNullString), "<!--", vbNullString) 'remove comments
arr = Split(urls(i), "/")
playerIdentifier = arr(5)
If SheetExists(playerIdentifier) Then
With ThisWorkbook.Worksheets(playerIdentifier).Cells
.ClearContents
.ClearFormats
Set wsCurrent = ThisWorkbook.Worksheets(playerIdentifier)
End With
Else
Set wsCurrent = ThisWorkbook.Worksheets.Add
wsCurrent.name = playerIdentifier
End If
Set hTable = html.querySelector("#pgl_basic")
clipboard.SetText hTable.outerHTML
clipboard.PutInClipboard
wsCurrent.Range("A1").PasteSpecial
Set wsCurrent = Nothing: Set hTable = Nothing: Erase arr: playerIdentifier = vbNullString
Application.CutCopyMode = False
Next
End With
End Sub
Public Function SheetExists(ByVal sheetName As String) As Boolean '<== function by #Rory
SheetExists = Evaluate("ISREF('" & sheetName & "'!A1)")
End Function
IE
Note proper page load wait used.
Option Explicit
Public Sub GetPlayerInfo()
Dim ieObj As InternetExplorer, htmlEle As IHTMLElement
Dim urls(), i As Long, j As Long, hTable As Object
Dim ws As Worksheet, wsCurrent As Object
Dim lastRow As Long, playerIdentifier As String, arr() As String
Application.ScreenUpdating = False
On Error GoTo errHand
Set ws = ThisWorkbook.Worksheets("Sheet1")
lastRow = ws.Cells(ws.Rows.Count, "A").End(xlUp).Row
urls = Application.Transpose(ws.Range("A1:A" & lastRow).Value)
Set ieObj = New InternetExplorer
With ieObj
.Visible = True
For j = LBound(urls) To UBound(urls)
.navigate urls(j)
While .Busy Or .readyState <> 4: DoEvents: Wend
arr = Split(urls(j), "/")
playerIdentifier = arr(5)
If SheetExists(playerIdentifier) Then
With ThisWorkbook.Worksheets(playerIdentifier).Cells
.ClearContents
.ClearFormats
Set wsCurrent = ThisWorkbook.Worksheets(playerIdentifier)
End With
Else
Set wsCurrent = ThisWorkbook.Worksheets.Add
wsCurrent.Name = playerIdentifier
End If
i = 1
Set hTable = .document.getElementById("pgl_basic")
If Not hTable Is Nothing Then
For Each htmlEle In hTable.getElementsByTagName("tr")
With wsCurrent
.Range("A" & i).Value = htmlEle.Children(0).textContent
.Range("B" & i).Value = htmlEle.Children(1).textContent
.Range("C" & i).Value = htmlEle.Children(2).textContent
.Range("D" & i).Value = htmlEle.Children(3).textContent
End With
i = i + 1
Next htmlEle
End If
Set wsCurrent = Nothing: Set hTable = Nothing: Erase arr: playerIdentifier = vbNullString
Next
End With
errHand:
Application.ScreenUpdating = True
If Err.Number <> 0 Then Debug.Print Err.Number, Err.Description
ie.Quit
End Sub

VBA: Subscript out of range or Type Mismatch

Very new to VBA, and am really stuck. Below is my code, you'll see near the end my For loop for Des and DesArr. All I'm trying to do with that loop is pull a column of cells from the work sheet "SIC", which is Sheet2 in my Workbook, I either get the error "Subscript out of Range" or "Type Mismatch" and whenever I try and google/correct for one, the other error takes its place. If anyone can help me work through this I'd greatly appreciate it!
Public Sub getGoogleDescriptions(strSearch As String)
Dim URL As String, strResponse As String
Dim objHTTP As Object
Dim htmlDoc As HTMLDocument
Dim result As String
Dim i As Integer
Dim u As Integer
Dim resultArr As Variant
Dim Des As String
Dim DesArr(2 To 48) As Long
Set htmlDoc = CreateObject("htmlfile")
'Set htmlDoc = New HTMLDocument
Dim objResults As Object
Dim objResult As Object
strSearch = Replace(strSearch, " ", "+")
URL = "https://www.google.com/search?q=" & strSearch
Set objHTTP = CreateObject("MSXML2.XMLHTTP")
With objHTTP
.Open "GET", URL, False
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
.send
htmlDoc.body.innerHTML = .responseText
End With
Set objResults = htmlDoc.getElementsByClassName("st")
Debug.Print objResults(0).innerText
result = CStr(objResults(0).innerText)
resultArr = Split(result, " ", -1, 0)
For i = LBound(resultArr) To UBound(resultArr) 'Define i to be the length of the List'
Debug.Print i, resultArr(i) 'Prints the corresponding index value and array element'
Next i 'repeat
Set htmlDoc = Nothing
Set objResults = Nothing
Set objHTTP = Nothing
Set wk = ActiveWorkbook
For u = 2 To 48
Des = Sheets("SIC").Range("C" & u).Value
DesArr(u) = Des
Next u
Debug.Print DesArr(2)
End Sub
You're getting type mismatch because it's expecting DesArr to be a long data type which is a number between -2,147,483,648 to 2,147,483,647.
In it's use within the subroutine, it's used as a variant. So 2 corrections - change it to a variant as shown below
Then just adjust your 2 to 48 to within your statement... in this case it's a simple offset of 2, so just use (u - 2) and your Variant length is 47 starting at 0 instead of 1.
Public Sub getGoogleDescriptions(strSearch As String)
Dim URL As String, strResponse As String
Dim objHTTP As Object
Dim htmlDoc As HTMLDocument
Dim result As String
Dim i As Integer
Dim u As Integer
Dim resultArr As Variant
Dim Des As String
Dim DesArr(0) : ReDim DesArr(46)
Set htmlDoc = CreateObject("htmlfile")
'Set htmlDoc = New HTMLDocument
Dim objResults As Object
Dim objResult As Object
strSearch = Replace(strSearch, " ", "+")
URL = "https://www.google.com/search?q=" & strSearch
Set objHTTP = CreateObject("MSXML2.XMLHTTP")
With objHTTP
.Open "GET", URL, False
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
.send
htmlDoc.body.innerHTML = .responseText
End With
Set objResults = htmlDoc.getElementsByClassName("st")
Debug.Print objResults(0).innerText
result = CStr(objResults(0).innerText)
resultArr = Split(result, " ", -1, 0)
For i = LBound(resultArr) To UBound(resultArr) 'Define i to be the length of the List'
Debug.Print i, resultArr(i) 'Prints the corresponding index value and array element'
Next i 'repeat
Set htmlDoc = Nothing
Set objResults = Nothing
Set objHTTP = Nothing
Set wk = ActiveWorkbook
For u = 2 To 48
Des = Sheets("SIC").Range("C" & u).Value
DesArr(u - 2) = Des
Next u
Debug.Print DesArr(0)
End Sub

Unable to convert my existing scraper to xmlhttp request

I have written a script to get Title from any random websites. It works flawlessly. I have written this using Internet Explorer. I've tried a lot but can't make the same using xmlhttp request as performance is a big issue to consider. What I've tried so far is:
Sub Title_scraping()
Dim IE As Object
Dim doc As Object, cel As Range
For Each cel In Range("A1:A5")
Set IE = CreateObject("InternetExplorer.Application")
IE.navigate cel.Value
While IE.Busy
DoEvents
Wend
Set doc = IE.document
x = x + 1
Cells(x, 2) = doc.title
Next cel
End Sub
Sites I've tried with and got results:
https://stackoverflow.com/documentation/
https://codereview.stackexchange.com/
https://yts.ag/browse-movies
Combining your code and the code posted here, here is your final code:
Sub GetData()
Dim title As String
Dim objHttp As Object, cel As Range, x As Long
Set objHttp = CreateObject("MSXML2.ServerXMLHTTP")
For Each cel In Range("A1:A5")
objHttp.Open "GET", cel.Value, False
On Error Resume Next
objHttp.send ""
title = objHttp.responseText
If InStr(1, UCase(title), "<TITLE>") Then
title = Mid(title, InStr(1, UCase(title), "<TITLE>") + Len("<TITLE>"))
title = Mid(title, 1, InStr(1, UCase(title), "</TITLE>") - 1)
Else
title = ""
End If
x = x + 1
Cells(x, 2) = title
Next cel
End Sub

Internet Explorer VBA Automation Error: The object Invoked has disconnected from its clients

I'm trying to write code that will read a value from Excel, look it up in an internal web based system and store the results back in the Excel. It reads the Excel with no problem, opens Internet Explorer with no problem, but when I then try to reference what's been opened, I get the above error. The line "ie.Navigate url" works, but the next line "Set DOC = ie.Document" generates the error. Any ideas on what's causing this? Here's my code:
Public Sub getClient()
Dim xOpen As Boolean
xOpen = False
Dim row As Long
Dim xL As Excel.Application
Set xL = New Excel.Application
xL.Visible = False
Dim wb As Excel.Workbook
Dim sh As Excel.Worksheet
'Change the name as needed, out put in some facility to input it or
'process multiples...
Dim filename As String
filename = "auditLookup.xlsx"
Set wb = xL.Workbooks.Open(getPath("Audit") + filename)
xOpen = True
Set sh = wb.Sheets(1)
Dim ie As Variant
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = True
Dim DOC As HTMLDocument
Dim idx As Integer
Dim data As String
Dim links As Variant
Dim lnk As Variant
Dim iRow As Long
iRow = 2 'Assume headers
Dim clientName As String
Dim clientID As String
Dim nameFound As Boolean
Dim idFound As Boolean
Dim url As String
While sh.Cells(iRow, 1) <> ""
'Just in case these IDs are ever prefixed with zeroes, I'm inserting
'some random character in front, but removing it of course when
'processing.
url = "https://.../" + mid(sh.Cells(iRow, 1), 2)
ie.navigate url
Set DOC = ie.Document
'Search td until we find "Name:" then the next td will be the name.
'Then search for "P1 ID (ACES):" and the next td with be that.
Set links = DOC.getElementsByTagName("td")
clientName = ""
clientID = ""
nameFound = False
idFound = False
For Each lnk In links
data = lnk.innerText
If nameFound Then
clientName = data
ElseIf idFound Then
clientID = data
End If
If nameFound And idFound Then
Exit For
End If
If data = "Name:" Then
nameFound = True
ElseIf data = "P1 ID (ACES):" Then
idFound = True
End If
Next
sh.Cells(iRow, 2) = clientName
sh.Cells(iRow, 2) = clientID
iRow = iRow + 1
Wend
Set ie = Nothing
If xOpen Then
wb.Save
Set wb = Nothing
xL.Quit
Set xL = Nothing
Set sh = Nothing
xOpen = False
End If
Exit Sub
Changing to:
Dim ie As InternetExplorer
Set ie = New InternetExplorerMedium
...
Solved the problem. Plus I did need to add back the Do loop mentioned in the comments:
Do
DoEvents
Loop Until ie.ReadyState = READYSTATE_COMPLETE

Print/Import all web page source data using vba

I have code below which imports only part of source code into sheet. I want all source code as it is.`Sub GetSourceCode()
Dim ie As Object
Dim str As String
Dim arr
str = Sheets("sheet2").Range("I1").Value
Set ie = CreateObject("INTERNETEXPLORER.APPLICATION")
ie.Navigate "https://tiweb.industrysoftware.automation.com/prdata/cgi-bin/n_prdata_index.cgi?"
ie.Visible = False
Do Until ie.ReadyState = 4
DoEvents
Loop
ie.Document.getelementsbyname("pr_numbers")(0).Value = str
Application.SendKeys ("~")
Do Until ie.ReadyState = 4
DoEvents
Loop
Worksheets("Download_PRdata2").Activate
arr = Split(ie.Document.body.outertext)
Worksheets("Download_PRdata2").Activate
ActiveSheet.Range("A1").Resize(UBound(arr) + 1, 1).Value = Application.Transpose(arr)
End Sub`
Hi you can refer the below code
' Fetch Entire Source Code
Private Sub HTML_VBA_Excel()
Dim oXMLHTTP As Object
Dim sPageHTML As String
Dim sURL As String
'Change the URL before executing the code
sURL = "http://www.google.com"
'Extract data from website to Excel using VBA
Set oXMLHTTP = CreateObject("MSXML2.ServerXMLHTTP")
oXMLHTTP.Open "GET", sURL, False
oXMLHTTP.send
sPageHTML = oXMLHTTP.responseText
'Get webpage data into Excel
' If longer sourcecode mean, you need to save to a external text file or somewhere,
' since excel cell have some limits on storing max characters
ThisWorkbook.Sheets(1).Cells(1, 1) = sPageHTML
MsgBox "XMLHTML Fetch Completed"
End Sub
Source : http://www.vbausefulcodes.in/usefulcodes/get-data-or-source-code-from-webpage-using-excel-vba.php
Hope this will be useful to you!
you can save source code in a text file like this. add the below function instead of this line ThisWorkbook.Sheets(1).Cells(1, 1) = sPageHTML
Createtextfile (sPageHTML)
and add this below function after End Sub.
Sub Createtextfile(sPageHTML)
Dim fso As Object
Set fso = CreateObject("Scripting.FileSystemObject")
Dim oFile As Object
strPath = "E:\test.txt"
Set oFile = fso.Createtextfile(strPath)
oFile.WriteLine sPageHTML
oFile.Close
Set fso = Nothing
Set oFile = Nothing
End Sub
Change the location where you want to save.