Im trying to scrape data from website: http://uk.investing.com/rates-bonds/financial-futures via vba, like real-time price, i.e. German 5 YR Bobl, US 30Y T-Bond, i have tried excel web query but it only scrapes the whole website, but I would like to scrape the rate only, is there a way of doing this?
There are several ways of doing this. This is an answer that I write hoping that all the basics of Internet Explorer automation will be found when browsing for the keywords "scraping data from website", but remember that nothing's worth as your own research (if you don't want to stick to pre-written codes that you're not able to customize).
Please note that this is one way, that I don't prefer in terms of performance (since it depends on the browser speed) but that is good to understand the rationale behind Internet automation.
1) If I need to browse the web, I need a browser! So I create an Internet Explorer browser:
Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
2) I ask the browser to browse the target webpage. Through the use of the property ".Visible", I decide if I want to see the browser doing its job or not. When building the code is nice to have Visible = True, but when the code is working for scraping data is nice not to see it everytime so Visible = False.
With appIE
.Navigate "http://uk.investing.com/rates-bonds/financial-futures"
.Visible = True
End With
3) The webpage will need some time to load. So, I will wait meanwhile it's busy...
Do While appIE.Busy
DoEvents
Loop
4) Well, now the page is loaded. Let's say that I want to scrape the change of the US30Y T-Bond:
What I will do is just clicking F12 on Internet Explorer to see the webpage's code, and hence using the pointer (in red circle) I will click on the element that I want to scrape to see how can I reach my purpose.
5) What I should do is straight-forward. First of all, I will get by the ID property the tr element which is containing the value:
Set allRowOfData = appIE.document.getElementById("pair_8907")
Here I will get a collection of td elements (specifically, tr is a row of data, and the td are its cells. We are looking for the 8th, so I will write:
Dim myValue As String: myValue = allRowOfData.Cells(7).innerHTML
Why did I write 7 instead of 8? Because the collections of cells starts from 0, so the index of the 8th element is 7 (8-1). Shortly analysing this line of code:
.Cells() makes me access the td elements;
innerHTML is the property of the cell containing the value we look for.
Once we have our value, which is now stored into the myValue variable, we can just close the IE browser and releasing the memory by setting it to Nothing:
appIE.Quit
Set appIE = Nothing
Well, now you have your value and you can do whatever you want with it: put it into a cell (Range("A1").Value = myValue), or into a label of a form (Me.label1.Text = myValue).
I'd just like to point you out that this is not how StackOverflow works: here you post questions about specific coding problems, but you should make your own search first. The reason why I'm answering a question which is not showing too much research effort is just that I see it asked several times and, back to the time when I learned how to do this, I remember that I would have liked having some better support to get started with. So I hope that this answer, which is just a "study input" and not at all the best/most complete solution, can be a support for next user having your same problem. Because I have learned how to program thanks to this community, and I like to think that you and other beginners might use my input to discover the beautiful world of programming.
Enjoy your practice ;)
Other methods were mentioned so let us please acknowledge that, at the time of writing, we are in the 21st century. Let's park the local bus browser opening, and fly with an XMLHTTP GET request (XHR GET for short).
Wiki moment:
XHR is an API in the form of an object whose methods transfer data
between a web browser and a web server. The object is provided by the
browser's JavaScript environment
It's a fast method for retrieving data that doesn't require opening a browser. The server response can be read into an HTMLDocument and the process of grabbing the table continued from there.
Note that javascript rendered/dynamically added content will not be retrieved as there is no javascript engine running (which there is in a browser).
In the below code, the table is grabbed by its id cr1.
In the helper sub, WriteTable, we loop the columns (td tags) and then the table rows (tr tags), and finally traverse the length of each table row, table cell by table cell. As we only want data from columns 1 and 8, a Select Case statement is used specify what is written out to the sheet.
Sample webpage view:
Sample code output:
VBA:
Option Explicit
Public Sub GetRates()
Dim html As HTMLDocument, hTable As HTMLTable '<== Tools > References > Microsoft HTML Object Library
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://uk.investing.com/rates-bonds/financial-futures", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT" 'to deal with potential caching
.send
html.body.innerHTML = .responseText
End With
Application.ScreenUpdating = False
Set hTable = html.getElementById("cr1")
WriteTable hTable, 1, ThisWorkbook.Worksheets("Sheet1")
Application.ScreenUpdating = True
End Sub
Public Sub WriteTable(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByVal ws As Worksheet)
Dim tSection As Object, tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, C As Long, tBody As Object
r = startRow: If ws Is Nothing Then Set ws = ActiveSheet
With ws
Dim headers As Object, header As Object, columnCounter As Long
Set headers = hTable.getElementsByTagName("th")
For Each header In headers
columnCounter = columnCounter + 1
Select Case columnCounter
Case 2
.Cells(startRow, 1) = header.innerText
Case 8
.Cells(startRow, 2) = header.innerText
End Select
Next header
startRow = startRow + 1
Set tBody = hTable.getElementsByTagName("tbody")
For Each tSection In tBody
Set tRow = tSection.getElementsByTagName("tr")
For Each tr In tRow
r = r + 1
Set tCell = tr.getElementsByTagName("td")
C = 1
For Each td In tCell
Select Case C
Case 2
.Cells(r, 1).Value = td.innerText
Case 8
.Cells(r, 2).Value = td.innerText
End Select
C = C + 1
Next td
Next tr
Next tSection
End With
End Sub
you can use winhttprequest object instead of internet explorer as it's good to load data excluding pictures n advertisement instead of downloading full webpage including advertisement n pictures those make internet explorer object heavy compare to winhttpRequest object.
This question asked long before. But I thought following information will useful for newbies. Actually you can easily get the values from class name like this.
Sub ExtractLastValue()
Set objIE = CreateObject("InternetExplorer.Application")
objIE.Top = 0
objIE.Left = 0
objIE.Width = 800
objIE.Height = 600
objIE.Visible = True
objIE.Navigate ("https://uk.investing.com/rates-bonds/financial-futures/")
Do
DoEvents
Loop Until objIE.readystate = 4
MsgBox objIE.document.getElementsByClassName("pid-8907-last")(0).innerText
End Sub
And if you are new to web scraping please read this blog post.
Web Scraping - Basics
And also there are various techniques to extract data from web pages. This article explain few of them with examples.
Web Scraping - Collecting Data From a Webpage
I modified some thing that were poping up error for me and end up with this which worked great to extract the data as I needed:
Sub get_data_web()
Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
With appIE
.navigate "https://finance.yahoo.com/quote/NQ%3DF/futures?p=NQ%3DF"
.Visible = True
End With
Do While appIE.Busy
DoEvents
Loop
Set allRowofData = appIE.document.getElementsByClassName("Ta(end) BdT Bdc($c-fuji-grey-c) H(36px)")
Dim i As Long
Dim myValue As String
Count = 1
For Each itm In allRowofData
For i = 0 To 4
myValue = itm.Cells(i).innerText
ActiveSheet.Cells(Count, i + 1).Value = myValue
Next
Count = Count + 1
Next
appIE.Quit
Set appIE = Nothing
End Sub
Related
I've written a script in vba using IE to parse some links from a webpage. The thing is the links are within an iframe. I've twitched my code in such a way so that the script will first find a link within that iframe and navigate to that new page and parse the required content from there. If i do this way then I can get all the links.
Webpage URL: weblink
Successful approach (working one):
Sub Get_Links()
Dim IE As New InternetExplorer, HTML As HTMLDocument
Dim elem As Object, post As Object
With IE
.Visible = True
.navigate "put here the above link"
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set elem = .document.getElementById("compInfo") #it is within iframe
.navigate elem.src
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set HTML = .document
End With
For Each post In HTML.getElementsByClassName("news")
With post.getElementsByTagName("a")
If .Length Then R = R + 1: Cells(R, 1) = .Item(0).href
End With
Next post
IE.Quit
End Sub
I've seen few sites where no such links exist within iframe so, I will have no option to use any link to track down the content.
If you take a look at the below approach by tracking the link then you can notice that I've parsed the content from a webpage which are within Iframe. There is no such link within Iframe to navigate to a new webpage to locate the content. So, I used contentWindow.document instead and found it working flawlessly.
Link to the working code of parsing Iframe content from another site:
contentWindow approach
However, my question is: why should i navigate to a new webpage to collect the links as I can see the content in the landing page? I tried using contentWindow.document but it is giving me access denied error. How can I make my below code work using contentWindow.document like I did above?
I tried like this but it throws access denied error:
Sub Get_Links()
Dim IE As New InternetExplorer, HTML As HTMLDocument
Dim frm As Object, post As Object
With IE
.Visible = True
.Navigate "put here the above link"
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set HTML = .document
End With
''the code breaks when it hits the following line "access denied error"
Set frm = HTML.getElementById("compInfo").contentWindow.document
For Each post In frm.getElementsByClassName("news")
With post.getElementsByTagName("a")
If .Length Then R = R + 1: Cells(R, 1) = .Item(0).href
End With
Next post
IE.Quit
End Sub
I've attached an image to let you know which links (they are marked with pencil) I'm after.
These are the elements within which one such link (i would like to grab) is found:
<div class="news">
<span class="news-date_time"><img src="images/arrow.png" alt="">19 Jan 2018 00:01</span>
<a style="color:#5b5b5b;" href="/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17019039003&opt=9">ABB India Limited - Press Release</a>
</div>
Image of the links of that page I would like to grab:
From the very first day while creating this thread I strictly requested not to use this url http://hindubusiness.cmlinks.com/Companydetails.aspx?cocode=INE117A01022 to locate the data. I requested any solution from this main_page_link without touching the link within iframe. However, everyone is trying to provide solutions that I've already shown in my post. What did I put a bounty for then?
You can see the links within <iframe> in browser but can't access them programmatically due to Same-origin policy.
There is the example showing how to retrieve the links using XHR and RegEx:
Option Explicit
Sub Test()
Dim sContent As String
Dim sUrl As String
Dim aLinks() As String
Dim i As Long
' Retrieve initial webpage HTML content via XHR
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.thehindubusinessline.com/stocks/abb-india-ltd/overview/", False
.Send
sContent = .ResponseText
End With
'WriteTextFile sContent, CreateObject("WScript.Shell").SpecialFolders("Desktop") & "\tmp\tmp.htm", -1
' Extract target iframe URL via RegEx
With CreateObject("VBScript.RegExp")
.Global = True
.MultiLine = True
.IgnoreCase = True
' Process all a within div.news
.Pattern = "<iframe[\s\S]*?src=""([^""]*?Companydetails[^""]*)""[^>]*>"
sUrl = .Execute(sContent).Item(i).SubMatches(0)
End With
' Retrieve iframe HTML content via XHR
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", sUrl, False
.Send
sContent = .ResponseText
End With
'WriteTextFile sContent, CreateObject("WScript.Shell").SpecialFolders("Desktop") & "\tmp\tmp.htm", -1
' Parse links via XHR
With CreateObject("VBScript.RegExp")
.Global = True
.MultiLine = True
.IgnoreCase = True
' Process all anchors within div.news
.Pattern = "<div class=""news"">[\s\S]*?href=""([^""]*)"
With .Execute(sContent)
ReDim aLinks(0 To .Count - 1)
For i = 0 To .Count - 1
aLinks(i) = .Item(i).SubMatches(0)
Next
End With
End With
Debug.Print Join(aLinks, vbCrLf)
End Sub
Generally RegEx's aren't recommended for HTML parsing, so there is disclaimer. Data being processed in this case is quite simple that is why it is parsed with RegEx.
The output for me as follows:
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17047038016&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17046039003&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17045039006&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17043039002&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17043010019&opt=9
I also tried to copy the content of the <iframe> from IE to clipboard (for further pasting to the worksheet) using commands:
IE.ExecWB OLECMDID_SELECTALL, OLECMDEXECOPT_DODEFAULT
IE.ExecWB OLECMDID_COPY, OLECMDEXECOPT_DODEFAULT
But actually that commands select and copy the main document, excluding the frame, unless I click on the frame manually. So that might be applied if click on the frame could be reproduced from VBA (frame node methods like .focus and .click didn't help).
Something like this should work. They key is to realize the iFrame is technically another Document. Reviewing the iFrame on the page you listed, you can easily use a web request to get at the data you need. As already mentioned, the reason you get an error is due to the Same-Origin policy. You could write something to get the src of the iFrame then do the web request as I've shown below, or, use IE to scrape the page, get the src, then load that page which looks like what you have done.
I would recommend using a web request approach, Internet Explorer can get annoying, fast.
Code
Public Sub SOExample()
Dim html As Object 'To store the HTML content
Dim Elements As Object 'To store the anchor collection
Dim Element As Object 'To iterate the anchor collection
Set html = CreateObject("htmlFile")
With CreateObject("MSXML2.XMLHTTP")
'Navigate to the source of the iFrame, it's another page
'View the source for the iframe. Alternatively -
'you could navigate to this page and use IE to scrape it
.Open "GET", "https://stocks.thehindubusinessline.com/Companydetails.aspx?&cocode=INE117A01022"
.send ""
'See if the request was ok, exit it there was an error
If Not .Status = 200 Then Exit Sub
'Assign the page's HTML to an HTML object
html.body.InnerHTML = .responseText
Set Elements = html.body.document.getElementByID("hmstockchart_CompanyNews1_updateGLVV")
Set Elements = Elements.getElementsByTagName("a")
For Each Element In Elements
'Print out the data to the Immediate window
Debug.Print Element.InnerText
Next
End With
End Sub
Results
ABB India Limited - AGM/Book Closure
Board of ABB India recommends final dividend
ABB India to convene AGM
ABB India to pay dividend
ABB India Limited - Outcome of Board Meeting
More ?
The simple of solution like everyone suggested is to directly go the link. This would take the IFRAME out of picture and it would be easier for you loop through links. But in case you still don't like the approach then you need to get a bit deeper into the hole.
Below is a function from a library I wrote long back in VB.NET
https://github.com/tarunlalwani/ScreenCaptureAPI/blob/2646c627b4bb70e36fe2c6603acde4cee3354b39/Source%20Code/ScreenCaptureAPI/ScreenCaptureAPI/ScreenCapture.vb#L803
Private Function _EnumIEFramesDocument(ByVal wb As HTMLDocumentClass) As Collection
Dim pContainer As olelib.IOleContainer = Nothing
Dim pEnumerator As olelib.IEnumUnknown = Nothing
Dim pUnk As olelib.IUnknown = Nothing
Dim pBrowser As SHDocVW.IWebBrowser2 = Nothing
Dim pFramesDoc As Collection = New Collection
_EnumIEFramesDocument = Nothing
pContainer = wb
Dim i As Integer = 0
' Get an enumerator for the frames
If pContainer.EnumObjects(olelib.OLECONTF.OLECONTF_EMBEDDINGS, pEnumerator) = 0 Then
pContainer = Nothing
' Enumerate and refresh all the frames
Do While pEnumerator.Next(1, pUnk) = 0
On Error Resume Next
' Clear errors
Err.Clear()
' Get the IWebBrowser2 interface
pBrowser = pUnk
If Err.Number = 0 Then
pFramesDoc.Add(pBrowser.Document)
i = i + 1
End If
Loop
pEnumerator = Nothing
End If
_EnumIEFramesDocument = pFramesDoc
End Function
So basically this is a VB.NET version of below C++ version
Accessing body (at least some data) in a iframe with IE plugin Browser Helper Object (BHO)
Now you just need to port it to VBA. The only problem you may have is finding the olelib rerefernce. Rest most of it is VBA compatible
So once you get the array of object, you will find the one which belongs to your frame and then you can just that one
frames = _EnumIEFramesDocument(IE)
frames.Item(1).document.getElementsByTagName("A").length
I thought I figured this out over the weekend, but it actually doesn't work the way I thought it would. I have a confidential corporate SharePoint site that I work with. I can't post the link here, or any specific data, but the concept below will illustrate the point fine.
I have a parent URL that I want to import data from. Let's say this is the parent URL.
http://www.sharenet.co.za/v3/q_sharelookup.php
From there, I want to import data from a specific link. Let's say this is the link: 'Building & Construction Materials'
I think the best way to do this is some kind of InStr() function and search for the string. Then, if found, click the link and open the child URL. When the child URL opens, it looks something like this:
http://www.sharenet.co.za/v3/sharesfound.php?ssector=2353&exch=JSE&bookmark=Building%20&%20Construction%20Materials&scheme=default
I can't tell what the sector numbers will be ahead of time, so I can't use a specific URL. I need to reference it as the parent and child, or maybe IE1 and IE2. I want to import all data from the child URL, which in this example, looks like this.
Name Full Name Code Sector
BUILDMX BUILDMAX LIMITED BDM 2353
KAYDAV KAYDAV GROUP LTD KDV 2353
AFRIMAT AFRIMAT LTD AFT 2353
Trellidor Trellidor Hldgs Ltd TRL 2353
MASONITE MASONITE (AFRICA) LIMITED MAS 2353
DAWN DISTRIBUTION AND WAREHOUSING NETWORK LIMITED DAW 2353
MAZOR MAZOR GROUP LTD MZR 2353
PPC PPC LIMITED PPC 2353
PPCN PPC Limited NPL PPCN 2353
Just to demonstrate how I tried to solve this, I tried the script below.
Sub ListLinks()
'Set a reference to microsoft Internet Controls
Dim IeApp As InternetExplorer
Dim sURL As String
Dim IeDoc As Object
Dim i As Long
Set IeApp = New InternetExplorer
IeApp.Visible = True
sURL = "http://www.sharenet.co.za/v3/q_sharelookup.php"
IeApp.Navigate sURL
Do
Loop Until IeApp.ReadyState = READYSTATE_COMPLETE
Set IeDoc = IeApp.Document
For i = 0 To IeDoc.Links.Length - 1
Cells(i + 1, 1).Value = IeDoc.Links(i).href
Next i
Set IeApp = Nothing
End Sub
I thought it would work fine, to list all URLs, and then loop through each to import data, but the problem on my SharePoint site is that the href doesn't appear to have any relevance to the name of the hyperlink.
In the picture above you can see 'Building & Construction Materials' in the TD element. If I can reference that in the 1st browser, and click the correct link to open a 2nd browser, and then reference that 2nd browser and scrape all TD elements from that, everything should work fine. Does anyone here know how to do that?
Good try on the code, got it pretty close- the one area that needs some fixing is when you try and get the list of items and loop it. You had the right idea on how it would work, but the HTML element syntaxes a little off so looks like just need some more experience using HTML objects... see sample code below:
Public Sub sampleCode()
Dim URL As String
Dim XMLHTTP As MSXML2.XMLHTTP60
Dim HTMLDoc_Main As HTMLDocument
Dim HTMLDoc_Secondary As HTMLDocument
Dim targetTable As HTMLObjectElement
Dim links As IHTMLElementCollection
Dim linkCounter As Long
Dim searchText As String
URL = "http://www.sharenet.co.za/v3/q_sharelookup.php"
searchText = "Building & Construction Materials"
Set XMLHTTP = New MSXML2.XMLHTTP60
Set HTMLDoc_Main = New HTMLDocument
With XMLHTTP
.Open "GET", URL, False
.send
While .readyState <> 4: Wend
HTMLDoc_Main.body.innerHTML = .responseText
End With
Set targetTable = HTMLDoc_Main.getElementsByClassName("dataTable")(0)
Set links = targetTable.getElementsByTagName("a")
For linkCounter = 0 To links.Length - 1
With links(linkCounter)
If InStr(1, .innerText, searchText) > 0 Then
Set XMLHTTP = New MSXML2.XMLHTTP60
Set HTMLDoc_Secondary = New HTMLDocument
XMLHTTP.Open "GET", .href, False
XMLHTTP.send
While XMLHTTP.readyState <> 4: Wend
HTMLDoc_Secondary.body.innerHTML = XMLHTTP.responseText
'Parse HTMLDoc_Secondary
End If
End With
Next
Set XMLHTTP = Nothing
Set HTMLDoc_Main = Nothing
Set HTMLDoc_Secondary = Nothing
End Sub
Couple notes- 1) I used XMLHTTPRequest instead of IE as it is faster so 2) you are going to need to add 'Microsoft HTML Object Library' and 'Microsoft XML, v6.0' to your references and 3) I can see you are outputting to ranges in your original code- if at all possible this should be avoided. Populate an array and then dump its entire contents out into your target sheet all at once to save time...
Hope this helps,
TheSilkCode
Issue:
I would like to retrieve a particular value (Prev Close) from multiple internet explorer websites and copy them to multiple cells (Column C) automatically. I know how to retrieve value from a single internet explorer websites to a single cell. But i have no idea how to retrieve from multiple websites and copy them to multiple cells.
My computer info:
1.window 8.1
2.excel 2013
3.ie 11
My excel reference
Microsoft Object Library: yes
Microsoft Internet Controls: yes
Microsoft Form 2.0 Object library: yes
Microsoft Script Control 1.0: yes
URL:
http://finance.yahoo.com/q?s=hpq&type=2button&fr=uh3_finance_web_gs_ctrl1&uhb=uhb2
Below is my VBA code:
Private Sub CommandButton1_Click()
Dim ie As Object
Dim Doc As HTMLDocument
Dim prevClose As String
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = 0
ie.navigate "http://finance.yahoo.com/q;_ylt=AsqtxVZ0vjCPfBnINCrCWlXJgfME?uhb=uhb2&fr=uh3_finance_vert_gs_ctrl1_e&type=2button&s=" & Range("b2").Value
Do
DoEvents
Loop Until ie.readyState = 4
Set Doc = ie.document
prevClose = Trim(Doc.getElementById("table1").getElementsByTagName("td")(0).innerText)
Range("c2").Value = prevClose
End Sub
Don't use multiple tabs unless you really need to. It's an un-scalable solution that breaks quickly as the tabs add up.
It's far simpler and easier to just use one tab and deal with one webpage at a time using simple looping constructs. For this I am assuming that your URLs are the one your provided + some string contained in column B.
Private Sub CommandButton1_Click()
Const YAHOO_PARTIAL_URL As String = "http://finance.yahoo.com/q;_ylt=AsqtxVZ0vjCPfBnINCrCWlXJgfME?uhb=uhb2&fr=uh3_finance_vert_gs_ctrl1_e&type=2button&s="
Dim ie As Object
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = 0
For r = 2 To 10 ' Or whatever your row count is.
ie.navigate YAHOO_PARTIAL_URL & Cells(r, "B").Value
Do
DoEvents
Loop Until ie.readyState = 4
Dim Doc As HTMLDocument
Set Doc = ie.document
Dim prevClose As String
prevClose = Trim(Doc.getElementById("table1").getElementsByTagName("td")(0).innerText)
Cells(r, "C").Value = prevClose
Next r
End Sub
I need to scrape Title, product description and Product code and save it into worksheet from <<<HERE>>> in this case those are :
"Catherine Lansfield Helena Multi Bedspread - Double"
"This stunning ivory bedspread has been specially designed to sit with the Helena bedroom range. It features a subtle floral design with a diamond shaped quilted finish. The bedspread is padded so can be used as a lightweight quilt in the summer or as an extra layer in the winter.
Polyester.
Size L260, W240cm.
Suitable for a double bed.
Machine washable at 30°C.
Suitable for tumble drying.
EAN: 5055184924746.
Product Code 116/4196"
I have tried different methods and none was good for me in the end. For Mid and InStr functions result was none, it could be that my code was wrong. Sorry i do not give any code because i had already messed it up many times and have had no result. I have tried to scrape hole page with GetDatafromPage. It works well, but for different product pages the output goes to different rows as ammount of elements changes from page to page. Also it`s not possible to scrape only chosen elements. So it is pointless to get value from defined cells.
Another option instead of using the InternetExplorer object is the xmlhttp object. Here is a similar example to kekusemau but instead using xmlhttp object to request the page. I am then loading the responseText from the xmlhttp object in the html file.
Sub test()
Dim xml As Object
Set xml = CreateObject("MSXML2.XMLHTTP")
xml.Open "Get", "http://www.argos.co.uk/static/Product/partNumber/1164196.htm", False
xml.send
Dim doc As Object
Set doc = CreateObject("htmlfile")
doc.body.innerhtml = xml.responsetext
Dim name
Set name = doc.getElementById("pdpProduct").getElementsByTagName("h1")(0)
MsgBox name.innerText
Dim desc
Set desc = doc.getElementById("genericESpot_pdp_proddesc2colleft").getElementsByTagName("div")(0)
MsgBox desc.innerText
Dim id
Set id = doc.getElementById("pdpProduct").getElementsByTagName("span")(0).getElementsByTagName("span")(2)
MsgBox id.innerText
End Sub
This seems to be not too difficult. You can use Firefox to take a look at the page structure (right-click somewhere and click inspect element, and go on from there...)
Here is a simple sample code:
Sub test()
Dim ie As InternetExplorer
Dim x
Set ie = New InternetExplorer
ie.Visible = True
ie.Navigate "http://www.argos.co.uk/static/Product/partNumber/1164196.htm"
While ie.ReadyState <> READYSTATE_COMPLETE
DoEvents
Wend
Set x = ie.Document.getElementById("pdpProduct").getElementsByTagName("h1")(0)
MsgBox Trim(x.innerText)
Set x = ie.Document.getElementById("genericESpot_pdp_proddesc2colleft").getElementsByTagName("div")(0)
MsgBox x.innerText
Set x = ie.Document.getElementById("pdpProduct").getElementsByTagName("span")(0).getElementsByTagName("span")(2)
MsgBox x.innerText
ie.Quit
End Sub
(I have a reference in Excel to Microsoft Internet Controls, I don't know if that is there by default, if not you have to set it first to run this code).
I am trying to find a way to get the data from yelp.com
I have a spreadsheet on which there are several keywords and locations. I am looking to extract data from yelp listings based on these keywords and locations already in my spreadsheet.
I have created the following code, but it seems to get absurd data and not the exact information I am looking for.
I want to get business name, address and phone number, but all I am getting is nothing. If anyone here could help me solve this problem.
Sub find()
Dim ie As Object
Set ie = CreateObject("InternetExplorer.Application")
With ie
ie.Visible = False
ie.Navigate "http://www.yelp.com/search?find_desc=boutique&find_loc=New+York%2C+NY&ns=1&ls=3387133dfc25cc99#start=10"
' Don't show window
ie.Visible = False
'Wait until IE is done loading page
Do While ie.Busy
Application.StatusBar = "Downloading information, lease wait..."
DoEvents
Loop
' Make a string from IE content
Set mDoc = ie.Document
peopleData = mDoc.body.innerText
ActiveSheet.Cells(1, 1).Value = peopleData
End With
peopleData = "" 'Nothing
Set mDoc = Nothing
End Sub
If you right click in IE, and do View Source, it is apparent that the data served on the site is not part of the document's .Body.innerText property. I notice this is often the case with dynamically served data, and that approach is really too simple for most web-scraping.
I open it in Google Chrome and inspect the elements to get an idea of what I'm really looking for, and how to find it using a DOM/HTML parser; you will need to add a reference to Microsoft HTML Object Library.
I think you can get it to return a collection of the <DIV> tags, and then check those for the classname with an If statment inside the loop.
I made some revisions to my original answer, this should print each record in a new cell:
Option Explicit
Private Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Sub find()
'Uses late binding, or add reference to Microsoft HTML Object Library
' and change variable Types to use intellisense
Dim ie As Object 'InternetExplorer.Application
Dim html As Object 'HTMLDocument
Dim Listings As Object 'IHTMLElementCollection
Dim l As Object 'IHTMLElement
Dim r As Long
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = False
.Navigate "http://www.yelp.com/search?find_desc=boutique&find_loc=New+York%2C+NY&ns=1&ls=3387133dfc25cc99#start=10"
' Don't show window
'Wait until IE is done loading page
Do While .readyState <> 4
Application.StatusBar = "Downloading information, Please wait..."
DoEvents
Sleep 200
Loop
Set html = .Document
End With
Set Listings = html.getElementsByTagName("LI") ' ## returns the list
For Each l In Listings
'## make sure this list item looks like the listings Div Class:
' then, build the string to put in your cell
If InStr(1, l.innerHTML, "media-block clearfix media-block-large main-attributes") > 0 Then
Range("A1").Offset(r, 0).Value = l.innerText
r = r + 1
End If
Next
Set html = Nothing
Set ie = Nothing
End Sub