How to scrape data from the following table format VBA - vba

I am trying to scrape all the table from start page to end contents from this Webpage
Using the code below I can scrape the table contents of page 1 but I don't know how can I modify the code to get the data from start page to end.
Option Explicit
Sub NBAStats()
Dim IE As Object, obj As Object
Dim r As Integer, c As Integer, t As Integer
Dim elemCollection As Object
Set IE = CreateObject("InternetExplorer.Application")
With IE
.Visible = True
.navigate ("http://stats.nba.com/league/player/#!/")
While IE.ReadyState <> 4
DoEvents
Wend
Do While IE.busy: DoEvents: Loop
ThisWorkbook.Sheet1.Clear
Set elemCollection = IE.Document.getElementsByTagName("TABLE")
For t = 0 To (elemCollection.Length - 1)
For r = 0 To (elemCollection(t).Rows.Length - 1)
For c = 0 To (elemCollection(t).Rows(r).Cells.Length - 1)
ThisWorkbook.Worksheets(1).Cells(r + 1, c + 1) = elemCollection(t).Rows(r).Cells(c).innerText
Next c
Next r
Next t
End With
Set IE = Nothing
End Sub

Try to find the Sitemap.xml of the website you are scraping. The sitemap.xml fill have all the links present in the webpage.
Import that xml file to your Excel Sheet, Read each link & fetch each table in it.

At first, in my opinion VBA automation of the Internet Explorer is highly instable and not really practicable in productive use-cases. This means also scraping data from web sites which are provided only for viewing within a browser is not really practicable in productive use-cases. If you are entitled to use those data then you should ask for another data source (XML or JSONfor example). If you are not entitled then you should not do that. Possible the provider of the web site does not agree with this.
To be clear, I'm talking about web sites like this, which provides it's data with JavaScript only. If the data would be within the HTML then you could get those data via XMLHTTP. This is another thing.
I will nevertheless provide a "solution". So you can't simply think "He is simply unable to do this, so he is saying you should not do that."
So you must analyze the site and pick out the elements you can click for navigation.
Option Explicit
Sub NBAStats()
Dim IE As Object
Dim r As Long, c As Long, t As Long, rSheet As Long, rStart As Long
Dim bReady As Boolean
Dim elementsTable As Object
Dim elementsPageNavRigth As Object
Dim elemPageNavRigth As Object
Dim elementsTableDiv As Object
ThisWorkbook.Worksheets(1).Cells.Clear
Set IE = CreateObject("InternetExplorer.Application")
With IE
.Visible = True
.navigate ("http://stats.nba.com/league/player/#!/")
Do While IE.busy
DoEvents
Loop
rSheet = 0
Do
Do While elementsTableDiv Is Nothing
Set elementsTableDiv = IE.Document.getElementsByClassName("table-responsive")
DoEvents
Loop
Do While elementsTableDiv(0) Is Nothing
DoEvents
Loop
Set elementsPageNavRigth = IE.Document.getElementsByClassName("page-nav right")
Set elemPageNavRigth = elementsPageNavRigth(0)
If elemPageNavRigth.className = "page-nav right disabled" Then bReady = True
'If rSheet = 0 Then rStart = 0 Else rStart = 1
Set elementsTable = elementsTableDiv(0).getElementsByTagName("TABLE")
For r = rStart To (elementsTable(0).Rows.Length - 1)
For c = 0 To (elementsTable(0).Rows(r).Cells.Length - 1)
ThisWorkbook.Worksheets(1).Cells(r + rSheet + 1, c + 1) = elementsTable(t).Rows(r).Cells(c).innerText
Next c
Next r
rSheet = rSheet + r
If Not elemPageNavRigth Is Nothing Then elemPageNavRigth.Click
Set elementsTableDiv = Nothing
Loop Until bReady Or elemPageNavRigth Is Nothing
End With
Set IE = Nothing
End Sub

Related

Need help to convert Internet Explorer based web scraping to XMLHTTP

I am trying to speed up some intranet webscraping as well as make it more reliable. I am just learning how to implement XMLHTTP and I need some advice on converting my code from IE based scrapping to XMLHTTP.
I have 2 subs in my module that accomplishes loading up and navigating the intranet site (GetWebTable) and parsing through the data (GetOneTable) to return a table in excel. The subs are as follows:
Sub GetWebTable(sAccountNum As String)
On Error Resume Next
Dim objIE As Object
Dim strBuffer As String
Dim thisCol As Integer
Dim iAcctCount As Integer
Dim iCounter As Integer
Dim iNextCounter As Integer
Dim iAcctCell As Integer
Dim thisColCustInfo As Integer
Dim iErrorCounter As Integer
If InStr(1, sAccountNum, "-") <> 0 Then
sAccountNum = Replace(sAccountNum, "-", "")
End If
If InStr(1, sAccountNum, " ") <> 0 Then
sAccountNum = Replace(sAccountNum, " ", "")
End If
iErrorCounter = 1
TRY_AGAIN:
'Spawn Internet Explorer
Set objIE = GetObject("new:{D5E8041D-920F-45e9-B8FB-XXXXXXX}")
DoEvents
With objIE
.Visible = False
.Navigate "http://intranetsite.aspx"
While .busy = True Or .readystate <> 4: DoEvents: Wend
While .Document.readyState <> "complete": DoEvents: Wend
.Document.getElementById("ctl00_MainContentRegion_tAcct").Value = sAcct
While .busy = True Or .readyState <> 4: DoEvents: Wend
While .Document.readyState <> "complete": DoEvents: Wend
.Document.getElementById("ctl00_MainContentRegion_btnRunReport").Click
While .busy = True Or .readyState <> 4: DoEvents: Wend
While .Document.readyState <> "complete": DoEvents: Wend
End With
thisCol = 53
thisColCustInfo = 53
GetOneTable objIE.Document, 9, thisCol
'Cleanup:
objIE.Quit
Set objIE = Nothing
GetWebTable_Error:
Select Case Err.Number
Case 0
Case Else
Debug.Print Err.Number, Err.Description
iErrorCounter = iErrorCounter + 1
objIE.Quit
Set objIE = Nothing
If iErrorCounter > 4 Then On Error Resume Next
GoTo TRY_AGAIN
'Stop
End Select
End Sub
Sub GetOneTable(varWebPageDoc, varTableNum, varColInsert)
Dim varDocElement As Object ' the elements of the document
Dim varDocTable As Object ' the table required
Dim varDocRow As Object ' the rows of the table
Dim varDocCell As Object ' the cells of the rows.
Dim Rng As Range
Dim iCellCount As Long
Dim iElemCount As Long
Dim iTableCount As Long
Dim iRowCount As Long
Dim iRowCounter As Integer
Dim bTableEndFlag As Boolean
bTableEndFlag = False
For Each varDocElement In varWebPageDoc.all
If varDocElement.nodeName = "TABLE" Then
iElemCount = iElemCount + 1
End If
If iElemCount = varTableNum Then
Set varDocTable = varDocElement
iTableCount = iTableCount + 1
iRowCount = iRowCount + 1
Set Rng = Worksheets("Sheet1").Cells(2, varColInsert)
For Each varDocRow In varDocTable.Rows
For Each varDocCell In varDocRow.Cells
If Left(varDocCell.innerText, 9) = "Total for" Then
bTableEndFlag = True
Exit For
End If
Rng.Value = varDocCell.innerText
Set Rng = Rng.Offset(, 1)
iCellCount = iCellCount + 1
Next varDocCell
iRowCount = iRowCount + 1
Set Rng = Rng.Offset(1, -iCellCount)
iCellCount = 0
Next varDocRow
Exit For
End If
Next varDocElement
Set varDocElement = Nothing
Set varDocTable = Nothing
Set varDocRow = Nothing
Set varDocCell = Nothing
Set Rng = Nothing
End Sub
Any thoughts?
HTML is not XML. XML is strictly enforced is terms of opening and closing tags whilst HTML is famous for <br> tags without closuing </br>. You'd be very lucky if the HTML is XML compliant.
Anyway, if you want to use XMLHTTP because of the HTTP request and still keep your IE based web scraping code then see this article http://exceldevelopmentplatform.blogspot.com/2018/01/vba-xmlhttp-request-xhr-does-not-parse.html It shows how to use XMLHTTP before passing response to MSHTML.
You can use MSHTML independently of IE, see this article Use MSHTML to parse local HTML file without using Internet Explorer (Microsoft HTML Object Library). If you read that you will see much of the code that you write against the IE object model is in fact aaginst the MSHTML object model and as such you can decouple and jettison IE. Enjoy!
EDIT1: Don't forget you can ask your company's IT staff
You say it is an intranet site which implies internal to your company, you could ask the programmers who are responsible for that system for a direct API guide.
EDIT2: Folding in feedback about how to mimic a browser...
To mimic the browser you need to figure out the traffic that button clicks generate...
To watch network traffic I recommend you switch to Chrome as your browser. Then, on this web page, right-click mouse button and take "Inspect" menu option, this opens the Chrome Developer Tools. Then, in Developer Tools select the Network tab, then click on a link on this page and you will see the traffic that is generated.
So, if you want to go pure XMLHTTP and leave browsers behind then you won't have buttons available to click but you can observe the network traffic that happens when a button is clicked in a browser and you can then mimic this in code.
So for example, in your comment you ask how do I enter an account number and click the button. I'm guessing that clicking a button will result in a XMLHTTP call of something like http://example.com/dowork/mypage.asp?accountnumber=1233456&otherParams=true so you see account number would be buried in the query parameters. Once you have that url you can put that in your XMLHTTP request.
One potential problem is that system designers may have chosen to hide account numbers in the body of a HTTP POST because it is sensitive/confidential data. However, Chrome Developer Tools is very good and should still yield that information but may have to poke around.

Unable to fetch some content using createDocumentFromUrl in vba

I've written some code in vba to get the movie names from a torrent website using .createDocumentFromUrl() method. As I've never worked with this method and haven't found any remarkable information either on it to successfully go with, I get stuck. I have tried to create a script, though.
Here is my try:
Sub Fromurl_Method()
Const URL As String = "https://yts.am/browse-movies"
Dim hStart As HTMLDocument, hdoc As HTMLDocument
Set hStart = New HTMLDocument
Set hdoc = hStart.createDocumentFromUrl(URL, vbNullString)
Do While hdoc.readyState = "loading" Or hdoc.readyState = "interactive": DoEvents: Loop
Debug.Print hdoc.DocumentElement.innerHTML
End Sub
When I execute the above script, it fetches some html elements which are not from that website. I took a closer look into the elements I have parsed and noticed this line on the top This content cannot be displayed in a frame. It is behaving the same way with most of the sites. How can I make it successful? Thanks in advance.
Once again, my intention is to parse all the movie names from that site using .createDocumentFromUrl().
Sub Get_Info()
Dim Elems, e As Variant
Const READYSTATE_COMPLETE& = 4&
Dim ie As Object
Set ie = Nothing
DoEvents
Set ie = CreateObject("InternetExplorer.Application")
DoEvents
With ie
.Visible = false
.Navigate "https://yts.am/browse-movies"
While Not .readyState = READYSTATE_COMPLETE
DoEvents
Wend
End With
Dim i As Double
With ie.Document
Set Elems = .getElementsByTagName("a")
DoEvents
i = 2
For Each e In Elems
If e.getAttribute("class") = "browse-movie-title" Then
Range("A" & i).Value = e.innerText
i = i + 1
End If
Next e
End With
Set Elems = Nothing
Set e = Nothing
ie.Quit
Set ie = Nothing
End Sub
The code above will give you a list of all movies. Just modify the code to adapt it to your needs of getting the first one if you only need the first one.

Extracting website data with Excel and VBA [duplicate]

Im trying to scrape data from website: http://uk.investing.com/rates-bonds/financial-futures via vba, like real-time price, i.e. German 5 YR Bobl, US 30Y T-Bond, i have tried excel web query but it only scrapes the whole website, but I would like to scrape the rate only, is there a way of doing this?
There are several ways of doing this. This is an answer that I write hoping that all the basics of Internet Explorer automation will be found when browsing for the keywords "scraping data from website", but remember that nothing's worth as your own research (if you don't want to stick to pre-written codes that you're not able to customize).
Please note that this is one way, that I don't prefer in terms of performance (since it depends on the browser speed) but that is good to understand the rationale behind Internet automation.
1) If I need to browse the web, I need a browser! So I create an Internet Explorer browser:
Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
2) I ask the browser to browse the target webpage. Through the use of the property ".Visible", I decide if I want to see the browser doing its job or not. When building the code is nice to have Visible = True, but when the code is working for scraping data is nice not to see it everytime so Visible = False.
With appIE
.Navigate "http://uk.investing.com/rates-bonds/financial-futures"
.Visible = True
End With
3) The webpage will need some time to load. So, I will wait meanwhile it's busy...
Do While appIE.Busy
DoEvents
Loop
4) Well, now the page is loaded. Let's say that I want to scrape the change of the US30Y T-Bond:
What I will do is just clicking F12 on Internet Explorer to see the webpage's code, and hence using the pointer (in red circle) I will click on the element that I want to scrape to see how can I reach my purpose.
5) What I should do is straight-forward. First of all, I will get by the ID property the tr element which is containing the value:
Set allRowOfData = appIE.document.getElementById("pair_8907")
Here I will get a collection of td elements (specifically, tr is a row of data, and the td are its cells. We are looking for the 8th, so I will write:
Dim myValue As String: myValue = allRowOfData.Cells(7).innerHTML
Why did I write 7 instead of 8? Because the collections of cells starts from 0, so the index of the 8th element is 7 (8-1). Shortly analysing this line of code:
.Cells() makes me access the td elements;
innerHTML is the property of the cell containing the value we look for.
Once we have our value, which is now stored into the myValue variable, we can just close the IE browser and releasing the memory by setting it to Nothing:
appIE.Quit
Set appIE = Nothing
Well, now you have your value and you can do whatever you want with it: put it into a cell (Range("A1").Value = myValue), or into a label of a form (Me.label1.Text = myValue).
I'd just like to point you out that this is not how StackOverflow works: here you post questions about specific coding problems, but you should make your own search first. The reason why I'm answering a question which is not showing too much research effort is just that I see it asked several times and, back to the time when I learned how to do this, I remember that I would have liked having some better support to get started with. So I hope that this answer, which is just a "study input" and not at all the best/most complete solution, can be a support for next user having your same problem. Because I have learned how to program thanks to this community, and I like to think that you and other beginners might use my input to discover the beautiful world of programming.
Enjoy your practice ;)
Other methods were mentioned so let us please acknowledge that, at the time of writing, we are in the 21st century. Let's park the local bus browser opening, and fly with an XMLHTTP GET request (XHR GET for short).
Wiki moment:
XHR is an API in the form of an object whose methods transfer data
between a web browser and a web server. The object is provided by the
browser's JavaScript environment
It's a fast method for retrieving data that doesn't require opening a browser. The server response can be read into an HTMLDocument and the process of grabbing the table continued from there.
Note that javascript rendered/dynamically added content will not be retrieved as there is no javascript engine running (which there is in a browser).
In the below code, the table is grabbed by its id cr1.
In the helper sub, WriteTable, we loop the columns (td tags) and then the table rows (tr tags), and finally traverse the length of each table row, table cell by table cell. As we only want data from columns 1 and 8, a Select Case statement is used specify what is written out to the sheet.
Sample webpage view:
Sample code output:
VBA:
Option Explicit
Public Sub GetRates()
Dim html As HTMLDocument, hTable As HTMLTable '<== Tools > References > Microsoft HTML Object Library
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://uk.investing.com/rates-bonds/financial-futures", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT" 'to deal with potential caching
.send
html.body.innerHTML = .responseText
End With
Application.ScreenUpdating = False
Set hTable = html.getElementById("cr1")
WriteTable hTable, 1, ThisWorkbook.Worksheets("Sheet1")
Application.ScreenUpdating = True
End Sub
Public Sub WriteTable(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByVal ws As Worksheet)
Dim tSection As Object, tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, C As Long, tBody As Object
r = startRow: If ws Is Nothing Then Set ws = ActiveSheet
With ws
Dim headers As Object, header As Object, columnCounter As Long
Set headers = hTable.getElementsByTagName("th")
For Each header In headers
columnCounter = columnCounter + 1
Select Case columnCounter
Case 2
.Cells(startRow, 1) = header.innerText
Case 8
.Cells(startRow, 2) = header.innerText
End Select
Next header
startRow = startRow + 1
Set tBody = hTable.getElementsByTagName("tbody")
For Each tSection In tBody
Set tRow = tSection.getElementsByTagName("tr")
For Each tr In tRow
r = r + 1
Set tCell = tr.getElementsByTagName("td")
C = 1
For Each td In tCell
Select Case C
Case 2
.Cells(r, 1).Value = td.innerText
Case 8
.Cells(r, 2).Value = td.innerText
End Select
C = C + 1
Next td
Next tr
Next tSection
End With
End Sub
you can use winhttprequest object instead of internet explorer as it's good to load data excluding pictures n advertisement instead of downloading full webpage including advertisement n pictures those make internet explorer object heavy compare to winhttpRequest object.
This question asked long before. But I thought following information will useful for newbies. Actually you can easily get the values from class name like this.
Sub ExtractLastValue()
Set objIE = CreateObject("InternetExplorer.Application")
objIE.Top = 0
objIE.Left = 0
objIE.Width = 800
objIE.Height = 600
objIE.Visible = True
objIE.Navigate ("https://uk.investing.com/rates-bonds/financial-futures/")
Do
DoEvents
Loop Until objIE.readystate = 4
MsgBox objIE.document.getElementsByClassName("pid-8907-last")(0).innerText
End Sub
And if you are new to web scraping please read this blog post.
Web Scraping - Basics
And also there are various techniques to extract data from web pages. This article explain few of them with examples.
Web Scraping - Collecting Data From a Webpage
I modified some thing that were poping up error for me and end up with this which worked great to extract the data as I needed:
Sub get_data_web()
Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
With appIE
.navigate "https://finance.yahoo.com/quote/NQ%3DF/futures?p=NQ%3DF"
.Visible = True
End With
Do While appIE.Busy
DoEvents
Loop
Set allRowofData = appIE.document.getElementsByClassName("Ta(end) BdT Bdc($c-fuji-grey-c) H(36px)")
Dim i As Long
Dim myValue As String
Count = 1
For Each itm In allRowofData
For i = 0 To 4
myValue = itm.Cells(i).innerText
ActiveSheet.Cells(Count, i + 1).Value = myValue
Next
Count = Count + 1
Next
appIE.Quit
Set appIE = Nothing
End Sub

Collect images url and data from webpage table to Excel table

Need to collect data from table on a webpage, some table cell have images.
The codes is to copy the data to Excel, and if the cell has images, then get its src links instead of images. below are the codes, but it is not working, I don't know how to detect if the cell has image in it or not, and add its src links to Excel cell.
Sub extractData()
Dim IE As Object, obj As Object
Dim myYear As String
Dim r As Integer, c As Integer, t As Integer
Dim elemCollection As Object
Set IE = CreateObject("InternetExplorer.Application")
myYear = InputBox("Enter year")
With IE
.Visible = True
.navigate ("url")
While IE.ReadyState <> 4
DoEvents
Wend
For Each obj In IE.Document.All.Item("Year").Options
If obj.innerText = myYear Then
obj.Selected = True
End If
Next obj
IE.Document.getElementsByName("btn_search").Item.Click
Do While IE.busy: DoEvents: Loop
ThisWorkbook.Sheets("Sheet1").Range("A1:K2000").ClearContents
Set elemCollection = IE.Document.getElementsByTagName("TABLE")
For t = 0 To (elemCollection.Length - 9)
For r = 0 To (elemCollection(t).Rows.Length - 1)
For c = 0 To (elemCollection(t).Rows(r).Cells.Length - 1)
If elemCollection(t).Rows(r).Cells(c).innerText = "" Then
ThisWorkbook.Worksheets(1).Cells(r + 1, c + 1) = elemCollection(t).Rows(r).Cells(c).getAttribute("src")
Exit For
End If
Next
ThisWorkbook.Worksheets(1).Cells(r + 1, c + 1) = elemCollection(t).Rows(r).Cells(c).innerText
Next c
Next r
Next t
End With
Set IE = Nothing
End Sub
First, brush up on HTML Document Object Model. There are tons of tutorials on how to use JavaScript to work with the DOM, and VBA is real similar (because the DOM doesn't change based on language and VBA is very closely related to VBScript which is similar to JavaScript). Second, if you get an error but no line is highlighted when you click the Debug button, step through your code line by line with the F8 key. This will let you figure out where the error is occurring. Third, add a reference to the Microsoft HTML Object Library so you can use Intellisense for code hints.
It's tough to give an actual solution without seeing the HTML source so instead I'll give some pointers:
Use IE.Document.Body.getElementsByTagName("TABLE") (note the addition of BODY to narrow the scope) to get a collection of every table on the page. With a reference to the HTML Objects Lib you can do something like this:
Dim oTable As HTMLTable
Dim oCell As HTMLTableCell
Dim oImg As HTMLImage
Dim strSrc As String
For Each oCell In oTable.Cells
strSrc = ""
On Error Resume Next
Set oImg = oCell.getElementsByTagName("img")
strSrc = oImg.Source
On Error GoTo 0
If strSrc <> "" Then Debug.Print strSrc
Next
This should (I did not test it) loop through every cell in a table and attempt to get an img element. If it fails, no biggie, just continue to the next cell. If you want to use late binding after you get it working, remove the HTML Obj Lib reference then simply dim everything as an object. Eg:
Dim oTable As Object 'HTMLTable
Dim oCell As Object 'HTMLTableCell
Dim oImg As Object 'HTMLImage
Dim strSrc As String

Vba- retrieve value from multiple internet explorer websites to multiple cells

Issue:
I would like to retrieve a particular value (Prev Close) from multiple internet explorer websites and copy them to multiple cells (Column C) automatically. I know how to retrieve value from a single internet explorer websites to a single cell. But i have no idea how to retrieve from multiple websites and copy them to multiple cells.
My computer info:
1.window 8.1
2.excel 2013
3.ie 11
My excel reference
Microsoft Object Library: yes
Microsoft Internet Controls: yes
Microsoft Form 2.0 Object library: yes
Microsoft Script Control 1.0: yes
URL:
http://finance.yahoo.com/q?s=hpq&type=2button&fr=uh3_finance_web_gs_ctrl1&uhb=uhb2
Below is my VBA code:
Private Sub CommandButton1_Click()
Dim ie As Object
Dim Doc As HTMLDocument
Dim prevClose As String
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = 0
ie.navigate "http://finance.yahoo.com/q;_ylt=AsqtxVZ0vjCPfBnINCrCWlXJgfME?uhb=uhb2&fr=uh3_finance_vert_gs_ctrl1_e&type=2button&s=" & Range("b2").Value
Do
DoEvents
Loop Until ie.readyState = 4
Set Doc = ie.document
prevClose = Trim(Doc.getElementById("table1").getElementsByTagName("td")(0).innerText)
Range("c2").Value = prevClose
End Sub
Don't use multiple tabs unless you really need to. It's an un-scalable solution that breaks quickly as the tabs add up.
It's far simpler and easier to just use one tab and deal with one webpage at a time using simple looping constructs. For this I am assuming that your URLs are the one your provided + some string contained in column B.
Private Sub CommandButton1_Click()
Const YAHOO_PARTIAL_URL As String = "http://finance.yahoo.com/q;_ylt=AsqtxVZ0vjCPfBnINCrCWlXJgfME?uhb=uhb2&fr=uh3_finance_vert_gs_ctrl1_e&type=2button&s="
Dim ie As Object
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = 0
For r = 2 To 10 ' Or whatever your row count is.
ie.navigate YAHOO_PARTIAL_URL & Cells(r, "B").Value
Do
DoEvents
Loop Until ie.readyState = 4
Dim Doc As HTMLDocument
Set Doc = ie.document
Dim prevClose As String
prevClose = Trim(Doc.getElementById("table1").getElementsByTagName("td")(0).innerText)
Cells(r, "C").Value = prevClose
Next r
End Sub