Excel VBA Macro: Scraping data from site table that spans multiple pages - vba

Thanks in advance for the help. I'm running Windows 8.1, I have the latest IE / Chrome browsers, and the latest Excel. I'm trying to write an Excel Macro that pulls data from StackOverflow (https://stackoverflow.com/tags). Specifically, I'm trying to pull the date (that the macro is run), the tag names, the # of tags, and the brief description of what the tag is. I have it working for the first page of the table, but not for the rest (there are 1132 pages at the moment). Right now, it overwrites the data everytime I run the macro, and I'm not sure how to make it look for the next empty cell before running.. Lastly, I'm trying to make it run automatically once per week.
I'd much appreciate any help here. Problems are:
Pulling data from the web table beyond the first page
Making it scrape data to the next empty row rather than overwriting
Making the Macro run automatically once per week
Code (so far) is below. Thanks!
Enum READYSTATE
READYSTATE_UNINITIALIZED = 0
READYSTATE_LOADING = 1
READYSTATE_LOADED = 2
READYSTATE_INTERACTIVE = 3
READYSTATE_COMPLETE = 4
End Enum
Sub ImportStackOverflowData()
'to refer to the running copy of Internet Explorer
Dim ie As InternetExplorer
'to refer to the HTML document returned
Dim html As HTMLDocument
'open Internet Explorer in memory, and go to website
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate "http://stackoverflow.com/tags"
'Wait until IE is done loading page
Do While ie.READYSTATE <> READYSTATE_COMPLETE
Application.StatusBar = "Trying to go to StackOverflow ..."
DoEvents
Loop
'show text of HTML document returned
Set html = ie.document
'close down IE and reset status bar
Set ie = Nothing
Application.StatusBar = ""
'clear old data out and put titles in
'Cells.Clear
'put heading across the top of row 3
Range("A3").Value = "Date Pulled"
Range("B3").Value = "Keyword"
Range("C3").Value = "# Of Tags"
'Range("C3").Value = "Asked This Week"
Range("D3").Value = "Description"
Dim TagList As IHTMLElement
Dim Tags As IHTMLElementCollection
Dim Tag As IHTMLElement
Dim RowNumber As Long
Dim TagFields As IHTMLElementCollection
Dim TagField As IHTMLElement
Dim Keyword As String
Dim NumberOfTags As String
'Dim AskedThisWeek As String
Dim TagDescription As String
'Dim QuestionFieldLinks As IHTMLElementCollection
Dim TodaysDate As Date
Set TagList = html.getElementById("tags-browser")
Set Tags = html.getElementsByClassName("tag-cell")
RowNumber = 4
For Each Tag In Tags
'if this is the tag containing the details, process it
If Tag.className = "tag-cell" Then
'get a list of all of the parts of this question,
'and loop over them
Set TagFields = Tag.all
For Each TagField In TagFields
'if this is the keyword, store it
If TagField.className = "post-tag" Then
'store the text value
Keyword = TagField.innerText
Cells(RowNumber, 2).Value = TagField.innerText
End If
If TagField.className = "item-multiplier-count" Then
'store the integer for number of tags
NumberOfTags = TagField.innerText
'NumberOfTags = Replace(NumberOfTags, "x", "")
Cells(RowNumber, 3).Value = Trim(NumberOfTags)
End If
If TagField.className = "excerpt" Then
Description = TagField.innerText
Cells(RowNumber, 4).Value = TagField.innerText
End If
TodaysDate = Format(Now, "MM/dd/yy")
Cells(RowNumber, 1).Value = TodaysDate
Next TagField
'go on to next row of worksheet
RowNumber = RowNumber + 1
End If
Next
Set html = Nothing
'do some final formatting
Range("A3").CurrentRegion.WrapText = False
Range("A3").CurrentRegion.EntireColumn.AutoFit
Range("A1:C1").EntireColumn.HorizontalAlignment = xlCenter
Range("A1:D1").Merge
Range("A1").Value = "StackOverflow Tag Trends"
Range("A1").Font.Bold = True
Application.StatusBar = ""
MsgBox "Done!"
End Sub

There's no need to scrape Stack Overflow when they make the underlying data available to you through things like the Data Explorer. Using this query in the Data Explorer should get you the results you need:
select t.TagName, t.Count, p.Body
from Tags t inner join Posts p
on t.ExcerptPostId = p.Id
order by t.count desc;
The permalink to that query is here and the "Download CSV" option which appears after the query runs is probably the easiest way to get the data into Excel. If you wanted to automate that part of things, the direct link to the CSV download of results is here

You can improve this to parse out exact elements but it loops all the pages and grabs all the tag info (everything next to a tag)
Option Explicit
Public Sub ImportStackOverflowData()
Dim ie As New InternetExplorer, html As HTMLDocument
Application.ScreenUpdating = False
With ie
.Visible = True
.navigate "https://stackoverflow.com/tags"
While .Busy Or .READYSTATE < 4: DoEvents: Wend
Set html = .document
Dim numPages As Long, i As Long, info As Object, item As Object, counter As Long
numPages = html.querySelector(".page-numbers.dots ~ a").innerText
For i = 1 To 2 ' numPages ''<==1 to 2 for testing; use to numPages
DoEvents
Set info = html.getElementById("tags_list")
For Each item In info.getElementsByClassName("grid-layout--cell tag-cell")
counter = counter + 1
Cells(counter, 1) = item.innerText
Next item
html.querySelector(".page-numbers.next").Click
While .Busy Or .READYSTATE < 4: DoEvents: Wend
Set html = .document
Next i
Application.ScreenUpdating = True
.Quit '<== Remember to quit application
End With
End Sub

I'm not making use of the DOM, but I find it very easy to get around just searching between known tags. If ever the expressions you are looking for are too common just tweak the code a bit so that it looks for a string after a string).
An example:
Public Sub ZipLookUp()
Dim URL As String, xmlHTTP As Object, html As Object, htmlResponse As String
Dim SStr As String, EStr As String, EndS As Integer, StartS As Integer
Dim Zip4Digit As String
URL = "https://tools.usps.com/go/ZipLookupResultsAction!input.action?resultMode=1&companyName=&address1=1642+Harmon+Street&address2=&city=Berkeley&state=CA&urbanCode=&postalCode=&zip=94703"
Set xmlHTTP = CreateObject("MSXML2.XMLHTTP")
xmlHTTP.Open "GET", URL, False
On Error GoTo NoConnect
xmlHTTP.send
On Error GoTo 0
Set html = CreateObject("htmlfile")
htmlResponse = xmlHTTP.ResponseText
If htmlResponse = Null Then
MsgBox ("Aborted Run - HTML response was null")
Application.ScreenUpdating = True
GoTo End_Prog
End If
'Searching for a string within 2 strings
SStr = "<span class=""address1 range"">" ' first string
EStr = "</span><br />" ' second string
StartS = InStr(1, htmlResponse, SStr, vbTextCompare) + Len(SStr)
EndS = InStr(StartS, htmlResponse, EStr, vbTextCompare)
Zip4Digit = Left(Mid(htmlResponse, StartS, EndS - StartS), 4)
MsgBox Zip4Digit
GoTo End_Prog
NoConnect:
If Err = -2147467259 Or Err = -2146697211 Then MsgBox "Error - No Connection": GoTo End_Prog 'MsgBox Err & ": " & Error(Err)
End_Prog:
End Sub

Related

Extracting data from website to excel error

I am having a difficulty setting up data extraction from website to Excel.
I want to extract exact price of a product to excel.
So far I have this code:
Sub GetData()
Dim objIE As InternetExplorer 'Microsoft Internet Controls library added
Dim itemEle As Object
Dim data As String
Dim y As Integer
Set objIE = New InternetExplorer
objIE.Visible = True
objIE.navigate "https://www.nay.sk/samsung-ue55nu7172"
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
y = 1
For Each itemEle In objIE.document.getElementsByClassName("price")
data = itemEle.getElementsByClassName("price")(0).innerText
y = y + 1
Next
data = Range("A1").Value
End Sub
What would you suggest?
Do you want every price?
You can list the first two for example this way:
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, i As Long, html As New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.nay.sk/samsung-ue55nu7172", False
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
Dim titles As Object, prices As Object
With html
.body.innerHTML = sResponse
Set titles = .querySelectorAll(".title")
Set prices = .querySelectorAll(".price")
End With
For i = 0 To 1
Debug.Print titles(i).innerText & prices(i).innerText
Next i
End Sub
That loop returns you these:
You actually have all the elements on the page with a price class stored in the object prices.
You can view all the prices by looping the length of that object/nodeList with:
For i = 0 To prices.Length - 1
Debug.Print Prices.item(i).innerText
Next i
Likewise you can loop the .Length of titles but note that it is a different length from prices. There are more prices on the page (or rather elements with a price class versus elements with a title class.
References (VBE>Tools>References):
HTML Object Library
Try this:
Sub GetData()
Dim objIE As New InternetExplorer 'Microsoft Internet Controls library added
Dim itemEle As Object
Dim data As String
Dim y As Integer
objIE.Visible = True
objIE.navigate "https://www.nay.sk/samsung-ue55nu7172"
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
y = 1
For Each itemEle In objIE.document.getElementsByClassName("price")
Cells(y, 1) = itemEle.outertext
y = y + 1
Next
End Sub
This is what you get:
To get the correct properties of the itemEle:
put a stop sign on the line from the printscreen below
select itemEle with your mouse
press Shift+F9

How to search for a keyword in entire website

I want to know if a certain keyword exists in entire website.
How can I do it?
Quick googling suggested this way
"Googling 101)
.. just type your search terms, followed by site:www.website.com
But I am not sure how to test if it returns positive or negative.
Can anyone help?
Something like this also
Function FIND_IN_PAGE(strURL As String, strSearch As String)
Dim pos As Long
Dim ie As SHDocVw.InternetExplorer
Dim doc As MSHTML.HTMLDocument
Set ie = New SHDocVw.InternetExplorer
ie.Visible = 1
ie.navigate strURL
Do Until ie.readyState = READYSTATE_COMPLETE And ie.Busy = False
DoEvents
Loop
Set doc = ie.document.DocumentElement
pos = InStr(1, doc.innerText, strSearch)
FIND_IN_PAGE = pos
ie.Quit
Set ie = Nothing
Set doc = Nothing
End Function
Calling like so
FIND_IN_PAGE("http://stackoverflow.com/questions/40848321/how-to-search-for-a-keyword-in-entire-website","entire")
Try this, it basically checks to see if there are any google search results by searching for a keyword or phrase on the site:
Sub Check_Website()
Dim ie As Object
Dim str As String, web As String, URL As String
Dim iResults As Integer
'Create IE object
Set ie = CreateObject("InternetExplorer.Application")
'Set string to search for
str = "hello"
str = Replace(str, " ", "+")
'Set website to search in
web = "www.google.com"
'Create full URL
URL = "https://www.google.co.uk/search?q=" & str & "+site%3A" & web
'Navigate to URL
With ie
.Visible = False
.Navigate URL
Do While .ReadyState <> 4: DoEvents: Loop
End With
'Count results on first page
iResults = ie.Document.getelementsbyclassname("g").Length
'Message box dependent on results
If iResults = 0 Then
MsgBox "No matches were found."
Else
MsgBox "Matches found."
End If
ie.Quit
Set ie = Nothing
End Sub
Google uses the class name of "g" for there search results meaning there will be a maximum of 10 items in the "g" class for the particular search results page, if no results are shown there is no "g" class which means there are no items to be counted.

Get website data from Urls using VBA

I have multiple urls stored in Excel sheet. I want to Get data reside within particular div tag. For One Website it works fine
Sub Cityline()
Dim IE As Object
Set IE = CreateObject("Internetexplorer.application")
IE.Visible = True
IE.navigate "http://Someurl.com/bla/bla/bla"
Do While IE.busy
DoEvents
Loop
Do
DoEvents
Dim Doc As Object
Set Doc = IE.Document
Dim workout As String
workout = Doc.getElementsByClassName("CLASS_NAME_OF_DATA")(0).innertext
Range("A2") = workout
Loop
End Sub
I used Below code for loop Through all urls but its not working
Sub GetData()
Dim oHtm As Object: Set oHtm = CreateObject("HTMLFile")
Dim req As Object: Set req = CreateObject("msxml2.xmlhttp")
Dim oRow As Object
Dim oCell As Range
Dim url As String
Dim y As Long, x As Long
x = 1
For Each oCell In Sheets("sheet1").Range("A2:A340")
req.Open "GET", oCell.Offset(, 1).Value, False
req.send
With oHtm
.body.innerhtml = req.responsetext
With .getelementsbytagname("table")(1)
With Sheets(1)
.Cells(x, 1).Value = oCell.Offset(, -1).Value
.Cells(x, 2).Value = oCell.Value
End With
y = 3
For Each oRow In .Rows
Sheets(1).Cells(x, y).Value = oRow.Cells(1).innertext
y = y + 1
Next oRow
End With
End With
x = x + 1
Next oCell
End Sub
But its not working
can any one suggest me where i went wrong ?
I used Fetching Data from multiple URLs but it doesn't works for me.
Please guide me how to get data from all urls at a Time
I'm new to SO, so apologies to the mods if this should be in comments (I couldn't get it to fit).
I agree with Silver's comments, but I thought I'd suggest a different approach that might help. If you have URLs in a column of cells, you could create a custom VBA function that will extract the relevant data out of the HTML. Just use this function in the cells to the right of your URL to return the relevant data from the HTML. An example is this:
Public Function GetHTMLData(SiteURL As String, FieldSearch As String) As String
Dim IE As Object
Dim BodyHTML As String
Dim FieldStart As Integer
Dim FieldEnd As Integer
Set IE = CreateObject("InternetExplorer.Application")
With IE
.Navigate SiteURL
Do While .Busy Or .ReadyState <> 4
DoEvents
Loop
BodyHTML = IIf(StrComp(.Document.Title, "Cannot find server", vbTextCompare) = 0, _
vbNullString, .Document.body.innerhtml)
FieldStart = InStr(1, BodyHTML, FieldSearch) + Len(FieldSearch) + 12
FieldEnd = InStr(FieldStart, BodyHTML, "<")
GetHTMLData = Mid(BodyHTML, FieldStart, FieldEnd - FieldStart)
.Quit
End With
Set IE = Nothing
End Function
The function above has 2 input parameters: the URL and a string that will be searched for within the HTML. It will then return a string from within the HTML, starting from 12 characters after the searched parameter and ending at the following '<' within the HTML.
Hope that helps.

Scraping dynamic web page with VBA based on XMLHTTP object

I have problem with scraping table data from this page [http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures]. I use this code, but do not scrape any data:
Public Sub ScrapTableData()
Dim sURL As String
Dim XMLHttpRequest As XMLHTTP
Dim HTMLDoc As New HTMLDocument
Dim elc As HTMLHtmlElement
Dim i As Integer
sURL = "http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures"
Set XMLHttpRequest = New MSXML2.XMLHTTP
XMLHttpRequest.Open "GET", sURL, False
XMLHttpRequest.responseXML.async = False
XMLHttpRequest.send
Do While XMLHttpRequest.Status <> 200
DoEvents
Loop
While XMLHttpRequest.ReadyState <> 4
DoEvents
Wend
HTMLDoc.body.innerHTML = XMLHttpRequest.responseText
' Tables
Dim tbl As HTMLTable, row As HTMLTableRow, cell As HTMLTableCell
i = 1
For Each tbl In HTMLDoc.getElementsByTagName("table")
For Each row In tbl.Rows
For Each cell In row.Cells
ActiveSheet.Cells(i, 5) = cell.innerText
i = i + 1
Next
Next
Next
End Sub
My code does not find HTML table tags.
Also, if I use this part of code, do not list all HTML tags (for example HTML DIV tag) and HTML that describes 6 buttons:
i = 0
Dim elc As HTMLHtmlElement
For Each elc In HTMLDoc.all
Worksheets("Tables").Range("A1").Offset(i, 0) = elc.tagName
i = i + 1
Next
6 buttons: Year, Quarter, Month,..., Day
I need to simulate click on them to display (scrape) different tables' data.
I don't think the XMLHTTP approach will work in this case, you need to open IE. The following code will do this. You may need to modify the loop to put data in your worksheet, I didn't tinker with this. At the end, I've also placed some code that will change the tabs. Hope this helps
Sub test()
' open IE, navigate to the website of interest and loop until fully loaded
Set IE = CreateObject("InternetExplorer.Application")
my_url = "http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures"
With IE
.Visible = True
.navigate my_url
.Top = 50
.Left = 530
.Height = 400
.Width = 400
Do Until Not IE.Busy And IE.readyState = 4
DoEvents
Loop
End With
' Collect data from tables
Set tbl = IE.document.getElementsByTagName("table")
For Each itm In tbl
i = 1
For Each itm2 In itm.Rows
For Each cell In itm2.Cells
ActiveSheet.Cells(i, 5) = cell.innertext
i = i + 1
Next
Next
Next
' Click on the 6 buttons, substitute "week", "year", etc. for the button you want to click
Set Results = IE.document.getElementsByTagName("a")
For Each itm In Results
If InStr(1, itm.innertext, "month", vbTextCompare) > 0 Then
itm.Click
Do Until Not IE.Busy And IE.readyState = 4
DoEvents
Loop
Exit For
End If
Next
' Do whatever is next
End Sub

VBA hanging on ie.busy and readystate check

I am trying to grab some football player data from a website to fill a privately used database. I've included the entire code below. This first section is a looper that calls the second function to fill a database. I've run this code in MSAccess to fill a database last summer and it worked great.
Now I am only getting a few teams to fill before the program gets hung up at
While IE.Busy Or IE.ReadyState <> READYSTATE_COMPLETE: DoEvents: Wend
I've searched countless websites regarding this error and tried changing this code by putting in sub function to wait a period of seconds or other work-arounds. None of those solve the issue. I've also tried running this on multiple computers.
The first computer made it through 3 teams (or three calls of the 2nd function). The second slower computer makes it through 5 teams. Both eventually hang. The 1st computer has Internet Explorer 10 and the second has IE8.
Sub Parse_NFL_RawSalaries()
Status ("Importing NFL Salary Information.")
Dim mydb As Database
Dim teamdata As DAO.Recordset
Dim i As Integer
Dim j As Double
Set mydb = CurrentDb()
Set teamdata = mydb.OpenRecordset("TEAM")
i = 1
With teamdata
Do Until .EOF
Call Parse_Team_RawSalaries(teamdata![RotoworldTeam])
.MoveNext
i = i + 1
j = i / 32
Status("Importing NFL Salary Information. " & Str(Round(j * 100, 0)) & "% done")
Loop
End With
teamdata.Close ' reset variables
Set teamdata = Nothing
Set mydb = Nothing
Status ("") 'resets the status bar
End Sub
Second function:
Function Parse_Team_RawSalaries(Team As String)
Dim mydb As Database
Dim rst As DAO.Recordset
Dim IE As InternetExplorer
Dim HTMLdoc As HTMLDocument
Dim TABLEelements As IHTMLElementCollection
Dim TRelements As IHTMLElementCollection
Dim TDelements As IHTMLElementCollection
Dim TABLEelement As Object
Dim TRelement As Object
Dim TDelement As HTMLTableCell
Dim c As Long
' open the table
Set mydb = CurrentDb()
Set rst = mydb.OpenRecordset("TempSalary")
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = False
IE.navigate "http://www.rotoworld.com/teams/contracts/nfl/" & Team
While IE.Busy Or IE.ReadyState <> READYSTATE_COMPLETE: DoEvents: Wend
Set HTMLdoc = IE.Document
Set TABLEelements = HTMLdoc.getElementsByTagName("Table")
For Each TABLEelement In TABLEelements
If TABLEelement.id = "cp1_tblContracts" Then
Set TRelements = TABLEelement.getElementsByTagName("TR")
For Each TRelement In TRelements
If TRelement.className <> "columnnames" Then
rst.AddNew
rst![Team] = Team
c = 0
Set TDelements = TRelement.getElementsByTagName("TD")
For Each TDelement In TDelements
Select Case c
Case 0
rst![Player] = Trim(TDelement.innerText)
Case 1
rst![position] = Trim(TDelement.innerText)
Case 2
rst![ContractTerms] = Trim(TDelement.innerText)
End Select
c = c + 1
Next TDelement
rst.Update
End If
Next TRelement
End If
Next TABLEelement
' reset variables
rst.Close
Set rst = Nothing
Set mydb = Nothing
IE.Quit
End Function
In Parse_Team_RawSalaries, instead of using the InternetExplorer.Application object, how about using MSXML2.XMLHTTP60?
So, instead of this:
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = False
IE.navigate "http://www.rotoworld.com/teams/contracts/nfl/" & Team
While IE.Busy Or IE.ReadyState <> READYSTATE_COMPLETE: DoEvents: Wend
Set HTMLdoc = IE.Document
Maybe try using this (add a reference to "Microsoft XML 6.0" in VBA Editor first):
Dim IE As MSXML2.XMLHTTP60
Set IE = New MSXML2.XMLHTTP60
IE.Open "GET", "http://www.rotoworld.com/teams/contracts/nfl/" & Team, False
IE.send
While IE.ReadyState <> 4
DoEvents
Wend
Dim HTMLDoc As MSHTML.HTMLDocument
Dim HTMLBody As MSHTML.htmlBody
Set HTMLDoc = New MSHTML.HTMLDocument
Set HTMLBody = HTMLDoc.body
HTMLBody.innerHTML = IE.responseText
I've generally found that MSXML2.XMLHTTP60 (and WinHttp.WinHttpRequest, for that matter) generally perform better (faster and more reliable) than InternetExplorer.Application.
I've found this post very helpful when I encountered similiar problem. Here is my solution:
I used
Dim browser As SHDocVw.InternetExplorer
Set browser = New SHDocVw.InternetExplorer
and
cTime = Now + TimeValue("00:01:00")
Do Until (browser.readyState = 4 And Not browser.Busy)
If Now < cTime Then
DoEvents
Else
browser.Quit
Set browser = Nothing
MsgBox "Error"
Exit Sub
End If
Loop
Sometimes page is loaded but code stops on DoEvents and goes on and on and on. Using this code it goes on only for 1 minute and if browser is not ready it quits the browser and exits sub.
I know this is a old post but. I have had the same problem with my code for downloading web site pictures using Excel VBA automation. Some sites wont let you download a image file using a link without first opening the link in a browser. However my code was getting hung up sometimes with when the objBrowser.visible was set to false with the folowing code
Do Until (objBrowser.busy = False And objBrowser.readyState = 4)
Application.Wait (Now + TimeValue("0:00:01"))
DoEvents 'browser.readyState = 4
Loop
the simple fix was to make the objBrowser.visible
I fixed it with
Dim Passes As Integer: Passes = 0
Do Until (objBrowser.busy = False And objBrowser.readyState = 4)
Passes = Passes + 1 'count loops
Application.Wait (Now + TimeValue("0:00:01"))
DoEvents
If Passes > 5 Then
'set size browser cannot set it smaller than 400
objBrowser.Width = 400 'set size
objBrowser.Height = 400
Label8.Caption = Passes 'display loop count
' position browser "you cannot move it off the screen" ready state wont change
objBrowser.Left = UserForm2.Left + UserForm2.Width
objBrowser.Top = UserForm2.Top + UserForm2.Height
objBrowser.Visible = True
DoEvents
objBrowser.Visible = False
End If
Loop
objBrowser only flashes for less than a second but it gets the job done!