I'm having trouble scraping this

I'm having trouble scraping this - vba

I'm trying to understand why my references arent working well to scrape this data.
Here is the site as an example:
http://quote.morningstar.ca/Quicktakes/Financials/is.aspx?t=GNTX&region=USA&culture=en-CA&ops=clear
And as a target:
<div id="data_i6" class="rf_crow"><div id="Y_1" class="pos column6Width_noChart116px" style="overflow:hidden;white-space: nowrap;" rawvalue="741131269">741</div><div id="Y_2" class="pos column6Width_noChart116px" style="overflow:hidden;white-space: nowrap;" rawvalue="836611464">837</div><div id="Y_3" class="pos column6Width_noChart116px" style="overflow:hidden;white-space: nowrap;" rawvalue="939841654">940</div><div id="Y_4" class="pos column6Width_noChart116px" style="overflow:hidden;white-space: nowrap;" rawvalue="1010472512">1,010</div><div id="Y_5" class="pos column6Width_noChart116px" style="overflow:hidden;white-space: nowrap;" rawvalue="1100344312">1,100</div><div id="Y_6" class="pos column6Width_noChart116px" style="overflow:hidden;white-space: nowrap;" rawvalue="1115401551">1,115</div></div>
What I need to extract is the actual value in rawvalue="741131269" and the following is what I've gotten to work so far.
'Cells(1, 1) = Document.getElementsByClassName("rf_crow")'returns the rows of data into one cell
'Cells(1, 1) = Document.getElementById("Y_1").innerText 'returns the text for the year
'Cells(1, 1) = Document.getElementById("data_i1").innerText 'returns to first row of data
I know the above doesn't return what I want, because the comment tells you what it extracts into Excel. The sub-element doesn't seem to work as it does in other macros I've built. I thought something like this would work:
Cells(1, 1) = Document.getElementById("Y_1").getAttribute("rawvalue")
but that doesn't work, also, I tried:
Cells(1, 1) = Document.getElementById("data_i6").getElementById("Y_1").innertext
and that doesn't work either.

The solution is very easy. Just call it using it's attribute which is `rawvalue.
This is how you can go:
Using Hardcoded delay and for loop to check the availability of the desired value:
Sub GetValue()
Dim IE As New InternetExplorer, HTML As HTMLDocument, post As Object, elem As Object
With IE
.Visible = True
.Navigate "http://quote.morningstar.ca/Quicktakes/Financials/is.aspx?t=GNTX&region=USA&culture=en-CA&ops=clear"
While .Busy = True Or .ReadyState < 4: DoEvents: Wend
Set HTML = .Document
End With
''using hardcoded delay
Application.Wait Now + TimeValue("00:00:05")
For Each elem In HTML.getElementsByTagName("div")
If elem.innerText = "741" Then MsgBox elem.getAttribute("rawvalue"): Exit For
Next elem
End Sub
Using Explicit Wait:
Sub GetValue()
Dim IE As New InternetExplorer, HTML As HTMLDocument, post As Object
With IE
.Visible = True
.Navigate "http://quote.morningstar.ca/Quicktakes/Financials/is.aspx?t=GNTX&region=USA&culture=en-CA&ops=clear"
While .Busy = True Or .ReadyState < 4: DoEvents: Wend
Set HTML = .Document
End With
Do: Set post = HTML.querySelector("#data_i6 #Y_1"): DoEvents: Loop While post Is Nothing
MsgBox post.getAttribute("rawvalue")
End Sub
Output at this moment:
741131269

The following should illuminate some of the problems you were having.
.querySelectorAll
The exact element you mention is the second index returned by .querySelectorAll method of .document using the CSS selector #Y_1. The # means Id.
From that webpage it returns the following (sample shown - not all):
From the above you can see the string you want is returned by the index of 2 in the result.
querySelectorAll with Id? Isn't Id a unique identifier for a single element?
This Id, unexpectedly, is not unique to a single element on the page. It occurs a whopping 27 times:
This means you can use the .querySelectorAll method to return a nodeList of all matching items and take the item at index 2 to get your result.
Note:
If you want the long number next to rawvalue, 741131269, then parse the outerHTML of the returned element.
Debug.Print Replace(Split(Split(a.item(2).outerHTML, "rawvalue=")(1), ">")(0), Chr$(34), vbNullString)
.querySelector
Alternatively, you can target the id which is specific data_i6 with
.document.querySelector("#data_i6")
This CSS selector (#data_i6) returns the entire row as it has each year within. If using .querySelector you will only get the first item anyway which is year 1.
You can be more specific with the CSS selector and add the additional year Id to get just the year of interest:
#data_i6 #Y_1
Code: (querySelector method commented out next to querySelectorAll)
Option Explicit
Public Sub Get_Information()
Dim IE As New InternetExplorer
With IE
.Visible = True
.navigate "http://quote.morningstar.ca/Quicktakes/Financials/is.aspx?t=GNTX&region=USA&culture=en-CA&ops=clear"
While .Busy = True Or .readyState < 4: DoEvents: Wend
Dim a As Object, exitTime As Date
exitTime = Now + TimeSerial(0, 0, 2)
Do
DoEvents
On Error Resume Next
Set a = .document.querySelectorAll("#Y_1") ' .document.querySelector("#data_i6 #Y_1")
On Error GoTo 0
If Now > exitTime Then Exit Do
Loop While a Is Nothing
If a Is Nothing Then Exit Sub
Debug.Print Split(Split(a.item(2).innerText, "rawvalue=")(0), ">")(0) 'Split(Split(a.innerText, "rawvalue=")(0), ">")(0)
Debug.Print Replace(Split(Split(a.item(2).outerHTML, "rawvalue=")(1), ">")(0), Chr$(34), vbNullString) 'Replace(Split(Split(a.outerHTML, "rawvalue=")(1), ">")(0), Chr$(34), vbNullString)
.Quit
End With
End Sub

Try and declare "objCollection" as an object, strValue as string, and in the code below, replace in the first line the name of the http-request you declared:
Document.body.innerHTML = YourHTTPRequest.responseText
Set objCollection = Document.getElementsByClassName("rf_crow")
For Each objElement In objCollection
If objElement.ID = "Y_1" Then
strValue = objElement.getAttribute("rawvalue")
Exit For
End If
Next
Cells(1, 1) = strValue

Does this work for you?
Sub web_table_option_two()
Dim HTMLDoc As New HTMLDocument
Dim objTable As Object
Dim lRow As Long
Dim lngTable As Long
Dim lngRow As Long
Dim lngCol As Long
Dim ActRw As Long
Dim objIE As InternetExplorer
Set objIE = New InternetExplorer
objIE.Navigate "http://quote.morningstar.ca/Quicktakes/Financials/is.aspx?t=GNTX&region=USA&culture=en-CA&ops=clear"
Do Until objIE.ReadyState = 4 And Not objIE.Busy
DoEvents
Loop
Application.Wait (Now + TimeValue("0:00:03")) 'wait for java script to load
HTMLDoc.body.innerHTML = objIE.Document.body.innerHTML
With HTMLDoc.body
Set objTable = .getElementsByTagName("table")
For lngTable = 0 To objTable.Length - 1
For lngRow = 0 To objTable(lngTable).Rows.Length - 1
For lngCol = 0 To objTable(lngTable).Rows(lngRow).Cells.Length - 1
ThisWorkbook.Sheets("Sheet1").Cells(ActRw + lngRow + 1, lngCol + 1) = objTable(lngTable).Rows(lngRow).Cells(lngCol).innerText
Next lngCol
Next lngRow
ActRw = ActRw + objTable(lngTable).Rows.Length + 1
Next lngTable
End With
objIE.Quit
End Sub

Related

Unable to make my script stop printing wrong result

I've created a script in vba using IE to fill in few inputs in a webpage in order to reach a new page to check for some items availability based on inputting some values in an inputbox.
To walk you through: what the script is currently doing:
Select Buy Bricks from landing page
Enter age 30 and country United Kingdom and then click on submit button
On the next page, enter the unique identification number for the Lego piece in the Element/design number box to populate result.
My script can satisfy all the requirements stated above. However, when I try with three different numbers, as in 4219725,765467 and 230223 I can see that the one in the middle 765467 doesn't populate any result but It prints the result of it's earlier number.
All the three numbers have been used in a for loop within my script below.
How can I make my script print nothing when there is no result instead of printing wrong result?
Site address
My script so far: (could not kick out hardcoded delay)
Sub GetDetails()
Const timeOut = 10
Dim IE As New InternetExplorer, Html As HTMLDocument
Dim elem As Object, post As Object, inputNum As Variant
Dim ageInput As Object, itm As Object, T As Date
With IE
.Visible = True
.navigate "https://www.lego.com/en-gb/service/replacementparts"
While .Busy Or .readyState < 4: DoEvents: Wend
Set Html = .document
Dim event_onChange As Object
Set event_onChange = .document.createEvent("HTMLEvents")
event_onChange.initEvent "change", True, False
Html.querySelectorAll(".arrow-list-info")(2).Click
Do: Set ageInput = Html.querySelector("input[id*='How old']"): DoEvents: Loop While ageInput Is Nothing
ageInput.innerText = 30
Html.querySelector("[label='United Kingdom").Selected = True
Html.querySelector("select").dispatchEvent event_onChange
Html.querySelector("[ng-click='startFlow()'").Click
While .Busy Or .readyState < 4: DoEvents: Wend
Set Html = .document
For Each inputNum In [{4219725,765467,230223}]
T = Timer
Do: Set post = Html.querySelector("[placeholder='Element/design number']"): DoEvents: Loop While post Is Nothing
post.ScrollIntoView
post.Focus
post.innerText = inputNum
Html.querySelector("button[ng-click='searchItemNumber()']").Click
'Can't kick out this hardcoded delay
Application.Wait Now + TimeValue("00:00:02")
Do
Set elem = Html.querySelector("div.list-item")
If Timer - T > timeOut Then Exit Do
DoEvents
Loop While elem Is Nothing
Set itm = Html.querySelector("h6.title")
If Not itm Is Nothing Then
Debug.Print itm.innerText
Else:
Debug.Print "Found Nothing"
End If
Next inputNum
Stop
End With
End Sub

So this needs tidying up but does it. I got rid of the explicit wait and added a wait for the spinner to disappear. For the no results section I look for an additional element to be present in the html when not found.
Option Explicit
Public Sub GetDetails()
Const timeOut = 10
Dim ie As New InternetExplorer, html As HTMLDocument
Dim elem As Object, post As Object, inputNum As Variant
Dim ageInput As Object, itm As Object, t As Date
With ie
.Visible = True
.navigate "https://www.lego.com/en-gb/service/replacementparts"
While .Busy Or .readyState < 4: DoEvents: Wend
Set html = .document
Dim event_onChange As Object
Set event_onChange = .document.createEvent("HTMLEvents")
event_onChange.initEvent "change", True, False
html.querySelectorAll(".arrow-list-info")(2).Click
Do: Set ageInput = html.querySelector("input[id*='How old']"): DoEvents: Loop While ageInput Is Nothing
ageInput.innerText = 30
html.querySelector("[label='United Kingdom']").Selected = True
html.querySelector("select").dispatchEvent event_onChange
html.querySelector("[ng-click='startFlow()']").Click
While .Busy Or .readyState < 4: DoEvents: Wend
For Each inputNum In [{4219725,765467,230223}]
Do: Set post = .document.querySelector("[placeholder='Element/design number']"): DoEvents: Loop While post Is Nothing
post.Focus
post.innerText = inputNum
html.querySelector("button[ng-click='searchItemNumber()']").Click
Do
Loop While .document.querySelectorAll(".basic-search-btn .icon-spinner-arrows").Length > 0
t = Timer
Do
Set elem = html.querySelector("div.list-item")
If Timer - t > timeOut Then Exit Do
DoEvents
Loop While elem Is Nothing
Set elem = Nothing
Set itm = html.querySelector("h6.title")
If html.querySelectorAll(".alert.alert-info.margin-top.ng-hide").Length = 1 Then
Debug.Print "Found nothing"
Else
Debug.Print itm.innerText
End If
Set itm = Nothing
Next inputNum
ie.Quit
End With
End Sub

Scraping website using Excel vba

So I'm new to vba and i am try to get price (i tried everyting my know). The macro is :
Sub Deneme()
Dim objIE As InternetExplorer
Dim Prc1 As String
Set objIE = New InternetExplorer
Dim Search_Terms() As Variant
Dim CopiedData() As Variant
Dim y As Integer
objIE.Visible = False
Search_Terms = Application.Transpose(ActiveSheet.Range("A2:A169").Value)
ReDim CopiedData(LBound(Search_Terms) To UBound(Search_Terms))
y = 2
For a = LBound(Search_Terms) To UBound(Search_Terms)
objIE.navigate "https://steamcommunity.com/market/listings/578080/" & Search_Terms(a)
Do: DoEvents: Loop Until objIE.readyState = 4
Prc1 = objIE.document.getElementsByClassName("market_commodity_orders_table")(4).getElementsByTagName("tr")(1).textContent '<----- the problem is here
ActiveSheet.Range("D" & y).Value = Prc1
y = y + 1
Next
objIE.Quit
End Sub
The website is THIS and I am trying to get this value:
Mostly error is :
Run-time error '91':
Object variable or With block variable not set.
And Debug is :
objIE.document.getElementsByClassName("market_commodity_orders_table")(4).getElementsByTagName("tr")(1).textContent

In the process of me testing my new code for you, I realized that you have other issues other than the class name you were attempting to use not existing.
The other issue is that the document loads before some of the other resources - this is likely due to the fact that this site updates the price every second (and therefore the price is not initially loaded in the objIE.Document object).
To get around this, I've added a couple of loops to wait for your object to become available. This should work for you.
Sub Deneme()
Dim objIE As InternetExplorer
Dim Prc1 As String
Set objIE = New InternetExplorer
Dim Search_Terms() As Variant
Dim CopiedData() As Variant
Dim y As Integer
Dim elemObj As Object
objIE.Visible = False
Search_Terms = Application.Transpose(ActiveSheet.Range("A2:A169").Value)
ReDim CopiedData(LBound(Search_Terms) To UBound(Search_Terms))
y = 2
For a = LBound(Search_Terms) To UBound(Search_Terms)
objIE.navigate "https://steamcommunity.com/market/listings/578080/" & Search_Terms(a)
Do: DoEvents: Loop Until objIE.readyState = 4
Do While Prc1 = ""
Do While elemObj Is Nothing
Set elemObj = objIE.document.getElementById("market_commodity_buyrequests")
Set elemObj = elemObj.getElementsByClassName("market_commodity_orders_header_promote")(1)
Loop
Prc1 = elemObj.innerText
Loop
ActiveSheet.Range("D" & y).Value = Prc1
Set elemObj = Nothing
Prc1 = vbNullString
y = y + 1
Next
objIE.Quit
End Sub

There are 2 issues in your code …
There is no class called market_commodity_orders_table
item counting starts with 0 so the 4ᵗʰ item is item no 3.
You can use this:
Prc1 = objIE.document.getElementsByClassName("market_commodity_orders_header_promote").Item(3).innerText

Let us try it in a slightly different manner. If you have IE9 or later then the following code should work for you flawlessly. I used .querySelector() here. Give this a shot and find the price you are after.
Sub GetPrice()
Const URL As String = "https://steamcommunity.com/market/listings/578080/PLAYERUNKNOWN's%20Bandana"
Dim HTML As HTMLDocument, post As Object
With New InternetExplorer
.Visible = True
.navigate URL
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set HTML = .document
Do: Set post = HTML.querySelector("#market_commodity_buyrequests .market_commodity_orders_header_promote:nth-of-type(2)"): DoEvents: Loop While post Is Nothing
[A1] = post.innerText
.Quit
End With
End Sub
Reference to add to the library:
Microsoft Internet Controls
Microsoft HTML Object Library

how to continue VBA code after opening a new web page

I'm new to creating VBA code and I'm slowly getting a basic understanding of it, however I'm unable to pass this point of my project without assistance. I have the code below and runs great up until I need to continue the code with the new page that opens. I have no idea on how to be able to continue the code and the plan is to be able to click on the odds comparison tab and extract data from that page. Any assistance would be much appreciated.
Sub odd_comparison()
Dim objIE As InternetExplorer
Dim ele As Object
Dim y As Integer
Set objIE = New InternetExplorer
objIE.Visible = True
objIE.navigate "http://www.flashscore.com/basketball/"
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
objIE.document.getElementById("fs").Children(0) _
.Children(2).Children(2).Children(0).Children(2).Click
End Sub

Try to make loop until the webpage ready as described in this and this answers (you know, replace WScript.Sleep with DoEvents for VBA).
Inspect the target element on the webpage with Developer Tools (using context menu or pressing F12). HTML content is as follows:
bwin.fr Odds
As you can see there is onclick attribute, and actually you can try to execute jscript code from it instead of invoking click method:
objIE.document.parentWindow.execScript "setNavigationCategory(4);pgenerate(true, 0,false,false,2); e_t.track_click('iframe-bookmark-click', 'odds');", "javascript"
Going further you can find the following spinner element, which appears for the short time while data is being loaded after the tab clicked:
<div id="preload" class="preload pvisit" style="display: none;"><span>Loading ...</span></div>
So you can detect when the data loading is completed by checking the visibility state:
Do Until objIE.document.getElementById("preload").style.display = "none"
DoEvents
Loop
The next step is extracting the data you need. You can get all tables from central block: .document.getElementById("fs").getElementsByTagName("table"), loop through tables and get all rows oTable.getElementsByTagName("tr"), and finally get all cells .getElementsByTagName("td") and innerText.
The below example shows how to extract all table data from the webpage odds comparison tab to Excel worksheet:
Option Explicit
Sub Test_Get_Data_www_flashscore_com()
Dim aData()
' clear sheet
Sheets(1).Cells.Delete
' retrieve content from web site, put into 2d array
aData = GetData()
' output array to sheet
Output Sheets(1).Cells(1, 1), aData
MsgBox "Completed"
End Sub
Function GetData()
Dim oIE As Object
Dim cTables As Object
Dim oTable As Object
Dim cRows As Object
Dim oRow As Object
Dim aItems()
Dim aRows()
Dim cCells As Object
Dim i As Long
Dim j As Long
Set oIE = CreateObject("InternetExplorer.Application")
With oIE
' navigate to target webpage
.Visible = True
.navigate "http://www.flashscore.com/basketball/"
' wait until webpage ready
Do While .Busy Or Not .readyState = 4: DoEvents: Loop
Do Until .document.readyState = "complete": DoEvents: Loop
Do While TypeName(.document.getElementById("fscon")) = "Null": DoEvents: Loop
' switch to odds tab
.document.parentWindow.execScript _
"setNavigationCategory(4);pgenerate(true, 0,false,false,2); e_t.track_click('iframe-bookmark-click', 'odds');", "javascript"
Do Until .document.getElementById("preload").Style.display = "none": DoEvents: Loop
' get all table nodes
Set cTables = .document.getElementById("fs").getElementsByTagName("table")
' put all rows into dictionary to compute total rows count
With CreateObject("Scripting.Dictionary")
' process all tables
For Each oTable In cTables
' get all row nodes within table
Set cRows = oTable.getElementsByTagName("tr")
' process all rows
For Each oRow In cRows
' put each row into dictionary
Set .Item(.Count) = oRow
Next
Next
' retrieve array from dictionary
aItems = .Items()
End With
' redim 1st dimension equal total rows count
ReDim aRows(1 To UBound(aItems) + 1, 1 To 1)
' process all rows
For i = 1 To UBound(aItems) + 1
Set oRow = aItems(i - 1)
' get all cell nodes within row
Set cCells = aItems(i - 1).getElementsByTagName("td")
' process all cells
For j = 1 To cCells.Length
' enlarge 2nd dimension if necessary
If UBound(aRows, 2) < j Then ReDim Preserve aRows(1 To UBound(aItems) + 1, 1 To j)
' put cell innertext into array
aRows(i, j) = Trim(cCells(j - 1).innerText)
DoEvents
Next
Next
.Quit
End With
' return populated array
GetData = aRows
End Function
Sub Output(objDstRng As Range, arrCells As Variant)
With objDstRng
.Parent.Select
With .Resize( _
UBound(arrCells, 1) - LBound(arrCells, 1) + 1, _
UBound(arrCells, 2) - LBound(arrCells, 2) + 1)
.NumberFormat = "#"
.Value = arrCells
.Columns.AutoFit
End With
End With
End Sub
Webpage odds comparison tab content for me is as follows:
It gives the output:

VBA Webscrape not picking up elmenents; pick up frames/tables?

Tried asking this question. Didn't get many answers. Can't install things onto my work computer. https://stackoverflow.com/questions/29805065/vba-webscrape-not-picking-up-elements
Want to scrape a morningstar page into Excel with the code below. Problem is, it doesn't feed any real elements/data back. I actually just want the Dividend and cap gain distribution table really from that link I put into my_Page.
This is usually easiest way, but an entire page scrape way, AND Excel-->Data-->From Web DON'T work.
I've tried to use get elements by tag name and class before, but I failed at being able to do it in this case.This might be the way to go... Once again, just want that Dividend and Cap Gain distribution table. Not seeing any results in via the Debug.print
Working code below, just need to parse into excel. Updated attempt below:
Sub Macro1()
Dim IE As New InternetExplorer
IE.Visible = True
IE.navigate "http://quotes.morningstar.com/fund/fundquote/f?&t=ANNPX&culture=en_us&platform=RET&viewId1=2046632524&viewId2=3141452350&viewId3=3475652630"
Do
DoEvents
Loop Until IE.readyState = READYSTATE_COMPLETE
Dim doc As HTMLDocument
Set doc = IE.document
'For Each Table In doc.getElementsByClassName("gr_table_b1")
'For Each td In Table.getElementsByTagName("tr")
On Error Resume Next
For Each td In doc.getElementsByClassName("gr_table_row4")
Debug.Print td.Cells(5).innerText
'Debug.Print td.Cells(1).innerText
Next td
'Next Table
'IE.Quit
'Application.EnableEvents = True
End Sub

The content in question is contained within an iframe. You can see this by right clicking on that section of the sebsite, and selecting Inspect element. Looking up the tree, you'll see an iframe tag, containing the url of data. You should try to find that element, and extract that url (which is generated with js) and then open that page.

No frame to worry about. You only need the table id.
Webpage view:
Print out from code:
VBA:
Option Explicit
Public Sub GetDivAndCapTable()
Dim ie As New InternetExplorer, hTable As HTMLTable
Const URL = "http://quotes.morningstar.com/fund/fundquote/f?&t=ANNPX&culture=en_us&platform=RET&viewId1=2046632524&viewId2=3141452350&viewId3=3475652630"
Application.ScreenUpdating = False
With ie
.Visible = True
.navigate URL
While .Busy Or .READYSTATE < 4: DoEvents: Wend
Set hTable = .document.getElementById("DividendAndCaptical")
WriteTable hTable, 1
Application.ScreenUpdating = True
.Quit
End With
End Sub
Public Sub WriteTable(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByVal ws As Worksheet)
If ws Is Nothing Then Set ws = ActiveSheet
Dim tSection As Object, tRow As Object, tCell As Object, tr As Object, td As Object, R As Long, C As Long, tBody As Object
R = startRow
With ws
Dim headers As Object, header As Object, columnCounter As Long
Set headers = hTable.getElementsByTagName("th")
For Each header In headers
columnCounter = columnCounter + 1
.Cells(startRow, columnCounter) = header.innerText
Next header
startRow = startRow + 1
Set tBody = hTable.getElementsByTagName("tbody")
For Each tSection In tBody 'HTMLTableSection
Set tRow = tSection.getElementsByTagName("tr") 'HTMLTableRow
For Each tr In tRow
Set tCell = tr.getElementsByTagName("td")
C = 1
For Each td In tCell 'DispHTMLElementCollection
.Cells(R, C).Value = td.innerText 'HTMLTableCell
C = C + 1
Next td
R = R + 1
Next tr
Next tSection
End With
End Sub

Excel VBA Macro: Scraping data from site table that spans multiple pages

Thanks in advance for the help. I'm running Windows 8.1, I have the latest IE / Chrome browsers, and the latest Excel. I'm trying to write an Excel Macro that pulls data from StackOverflow (https://stackoverflow.com/tags). Specifically, I'm trying to pull the date (that the macro is run), the tag names, the # of tags, and the brief description of what the tag is. I have it working for the first page of the table, but not for the rest (there are 1132 pages at the moment). Right now, it overwrites the data everytime I run the macro, and I'm not sure how to make it look for the next empty cell before running.. Lastly, I'm trying to make it run automatically once per week.
I'd much appreciate any help here. Problems are:
Pulling data from the web table beyond the first page
Making it scrape data to the next empty row rather than overwriting
Making the Macro run automatically once per week
Code (so far) is below. Thanks!
Enum READYSTATE
READYSTATE_UNINITIALIZED = 0
READYSTATE_LOADING = 1
READYSTATE_LOADED = 2
READYSTATE_INTERACTIVE = 3
READYSTATE_COMPLETE = 4
End Enum
Sub ImportStackOverflowData()
'to refer to the running copy of Internet Explorer
Dim ie As InternetExplorer
'to refer to the HTML document returned
Dim html As HTMLDocument
'open Internet Explorer in memory, and go to website
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate "http://stackoverflow.com/tags"
'Wait until IE is done loading page
Do While ie.READYSTATE <> READYSTATE_COMPLETE
Application.StatusBar = "Trying to go to StackOverflow ..."
DoEvents
Loop
'show text of HTML document returned
Set html = ie.document
'close down IE and reset status bar
Set ie = Nothing
Application.StatusBar = ""
'clear old data out and put titles in
'Cells.Clear
'put heading across the top of row 3
Range("A3").Value = "Date Pulled"
Range("B3").Value = "Keyword"
Range("C3").Value = "# Of Tags"
'Range("C3").Value = "Asked This Week"
Range("D3").Value = "Description"
Dim TagList As IHTMLElement
Dim Tags As IHTMLElementCollection
Dim Tag As IHTMLElement
Dim RowNumber As Long
Dim TagFields As IHTMLElementCollection
Dim TagField As IHTMLElement
Dim Keyword As String
Dim NumberOfTags As String
'Dim AskedThisWeek As String
Dim TagDescription As String
'Dim QuestionFieldLinks As IHTMLElementCollection
Dim TodaysDate As Date
Set TagList = html.getElementById("tags-browser")
Set Tags = html.getElementsByClassName("tag-cell")
RowNumber = 4
For Each Tag In Tags
'if this is the tag containing the details, process it
If Tag.className = "tag-cell" Then
'get a list of all of the parts of this question,
'and loop over them
Set TagFields = Tag.all
For Each TagField In TagFields
'if this is the keyword, store it
If TagField.className = "post-tag" Then
'store the text value
Keyword = TagField.innerText
Cells(RowNumber, 2).Value = TagField.innerText
End If
If TagField.className = "item-multiplier-count" Then
'store the integer for number of tags
NumberOfTags = TagField.innerText
'NumberOfTags = Replace(NumberOfTags, "x", "")
Cells(RowNumber, 3).Value = Trim(NumberOfTags)
End If
If TagField.className = "excerpt" Then
Description = TagField.innerText
Cells(RowNumber, 4).Value = TagField.innerText
End If
TodaysDate = Format(Now, "MM/dd/yy")
Cells(RowNumber, 1).Value = TodaysDate
Next TagField
'go on to next row of worksheet
RowNumber = RowNumber + 1
End If
Next
Set html = Nothing
'do some final formatting
Range("A3").CurrentRegion.WrapText = False
Range("A3").CurrentRegion.EntireColumn.AutoFit
Range("A1:C1").EntireColumn.HorizontalAlignment = xlCenter
Range("A1:D1").Merge
Range("A1").Value = "StackOverflow Tag Trends"
Range("A1").Font.Bold = True
Application.StatusBar = ""
MsgBox "Done!"
End Sub

There's no need to scrape Stack Overflow when they make the underlying data available to you through things like the Data Explorer. Using this query in the Data Explorer should get you the results you need:
select t.TagName, t.Count, p.Body
from Tags t inner join Posts p
on t.ExcerptPostId = p.Id
order by t.count desc;
The permalink to that query is here and the "Download CSV" option which appears after the query runs is probably the easiest way to get the data into Excel. If you wanted to automate that part of things, the direct link to the CSV download of results is here

You can improve this to parse out exact elements but it loops all the pages and grabs all the tag info (everything next to a tag)
Option Explicit
Public Sub ImportStackOverflowData()
Dim ie As New InternetExplorer, html As HTMLDocument
Application.ScreenUpdating = False
With ie
.Visible = True
.navigate "https://stackoverflow.com/tags"
While .Busy Or .READYSTATE < 4: DoEvents: Wend
Set html = .document
Dim numPages As Long, i As Long, info As Object, item As Object, counter As Long
numPages = html.querySelector(".page-numbers.dots ~ a").innerText
For i = 1 To 2 ' numPages ''<==1 to 2 for testing; use to numPages
DoEvents
Set info = html.getElementById("tags_list")
For Each item In info.getElementsByClassName("grid-layout--cell tag-cell")
counter = counter + 1
Cells(counter, 1) = item.innerText
Next item
html.querySelector(".page-numbers.next").Click
While .Busy Or .READYSTATE < 4: DoEvents: Wend
Set html = .document
Next i
Application.ScreenUpdating = True
.Quit '<== Remember to quit application
End With
End Sub

I'm not making use of the DOM, but I find it very easy to get around just searching between known tags. If ever the expressions you are looking for are too common just tweak the code a bit so that it looks for a string after a string).
An example:
Public Sub ZipLookUp()
Dim URL As String, xmlHTTP As Object, html As Object, htmlResponse As String
Dim SStr As String, EStr As String, EndS As Integer, StartS As Integer
Dim Zip4Digit As String
URL = "https://tools.usps.com/go/ZipLookupResultsAction!input.action?resultMode=1&companyName=&address1=1642+Harmon+Street&address2=&city=Berkeley&state=CA&urbanCode=&postalCode=&zip=94703"
Set xmlHTTP = CreateObject("MSXML2.XMLHTTP")
xmlHTTP.Open "GET", URL, False
On Error GoTo NoConnect
xmlHTTP.send
On Error GoTo 0
Set html = CreateObject("htmlfile")
htmlResponse = xmlHTTP.ResponseText
If htmlResponse = Null Then
MsgBox ("Aborted Run - HTML response was null")
Application.ScreenUpdating = True
GoTo End_Prog
End If
'Searching for a string within 2 strings
SStr = "<span class=""address1 range"">" ' first string
EStr = "</span><br />" ' second string
StartS = InStr(1, htmlResponse, SStr, vbTextCompare) + Len(SStr)
EndS = InStr(StartS, htmlResponse, EStr, vbTextCompare)
Zip4Digit = Left(Mid(htmlResponse, StartS, EndS - StartS), 4)
MsgBox Zip4Digit
GoTo End_Prog
NoConnect:
If Err = -2147467259 Or Err = -2146697211 Then MsgBox "Error - No Connection": GoTo End_Prog 'MsgBox Err & ": " & Error(Err)
End_Prog:
End Sub

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

I'm having trouble scraping this - vba

Related

Unable to make my script stop printing wrong result

Scraping website using Excel vba

how to continue VBA code after opening a new web page

VBA Webscrape not picking up elmenents; pick up frames/tables?

Excel VBA Macro: Scraping data from site table that spans multiple pages

Categories

Resources