I am wondering if it is possible to VBA to scrape data from public facebook pages such as number of followers or number of likes. I put together the code below.
Right now I am assuming to face two issues: (i) internet explorer is not supported anymore by facebook and (ii) I am not sure if facebook allows scraping.
So, I guess what I am looking for is actually a code which a reads out the webpage's source code. To keep things simple: I would be happy to use internet explorer (i.e., skip problem (i)) and just read out the number of "others like this".
See attached screenshot for class name in DOM explorer and number of likes I am looking for.Screenshot to illustrate target figure and DOM class name.
Any ideas?
Code:
Sub social_facebook()
Dim IE As New InternetExplorer
Dim html As HTMLDocument
Dim url As String
url = "https://www.facebook.com/adidasoriginals"
With ActiveSheet
Dim results(0 To 4) ', counter As Long, i As Long
With IE
.Visible = True
.navigate url
While .Busy Or .readyState < 4: DoEvents: Wend
'--------------------------------------------------------------------------
Set html = IE.document
Set HTMLDivElement = html.getElementsByClassName("_59k _2rgt _1j-f _2rgt")
'_59k _2rgt _1j-f _2rgt >> this is - according to my understanding - the class name I am looking for.
Debug.Print HTMLDivElement.innerHTML
.Quit
End With
End With
'-------------------------------------------------------------------------
End Sub
HTMLDivElement is a collection. You need to get first item :
Set html = IE.document
Set HTMLDivElement = html.getElementsByClassName("_59k _2rgt _1j-f _2rgt")
'_59k _2rgt _1j-f _2rgt >> this is - according to my understanding - the class name I am looking for.
Debug.Print HTMLDivElement(0).innerHTML
The getElementsByClassName method of Document interface returns an array-like object of all child elements which have all of the given class name(s). The element you want is the fifth element of class _59k _2rgt _1j-f _2rgt, so the code is like this:
Set html = IE.document
Set HTMLDivElement = html.getElementsByClassName("_59k _2rgt _1j-f _2rgt")(4)
Debug.Print HTMLDivElement.innerHTML
Related
I've written a script in vba in combination with IE to click on some dots available on a map in a web page. When a dot is clicked, a small box containing relevant information pops up.
Link to that website
I would like to parse the content of each box. The content of that box can be found using class name contentPane. However, the main concern here is to generate each box by clicking on those dots. When a box shows up, it looks how you can see in the below image.
This is the script I've tried so far:
Sub HitDotOnAMap()
Const Url As String = "https://www.arcgis.com/apps/Embed/index.html?webmap=4712740e6d6747d18cffc6a5fa5988f8&extent=-141.1354,10.7295,-49.7292,57.6712&zoom=true&scale=true&search=true&searchextent=true&details=true&legend=true&active_panel=details&basemap_gallery=true&disable_scroll=true&theme=light"
Dim IE As New InternetExplorer, HTML As HTMLDocument
Dim post As Object, I&
With IE
.Visible = True
.navigate Url
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set HTML = .document
End With
Application.Wait Now + TimeValue("00:0:07") ''the following line zooms in the slider
HTML.querySelector("#mapDiv_zoom_slider .esriSimpleSliderIncrementButton").Click
Application.Wait Now + TimeValue("00:0:04")
With HTML.querySelectorAll("[id^='NWQMC_VM_directory_'] circle")
For I = 0 To .Length - 1
.item(I).Focus
.item(I).Click
Application.Wait Now + TimeValue("00:0:03")
Set post = HTML.querySelector(".contentPane")
Debug.Print post.innerText
HTML.querySelector("[class$='close']").Click
Next I
End With
End Sub
when I execute the above script, it looks like it is running smoothly but nothing happens (I meant, no clicking) and it doesn't throw any error either. Finally it quits the browser gracefully.
This is how a box with information looks like when a dot gets clicked.
Although I've used hardcoded delay within my script, they can be fixed later as soon as the macro starts working.
Question: How can I click each of the dots on that map and collect the relevant information from the popped-up box? I only expect to have any solution using Internet Explorer
The data are not the main concern here. I would like to know how IE work in such cases so that I can deal with them in future cases. Any solution other than IE is not I'm looking for.
No need to click on each dots. Json file has all the details and you can extract as per your requirement.
Installation of JsonConverter
Download the latest release
Import JsonConverter.bas into your project (Open VBA Editor, Alt + F11; File > Import File)
Add Dictionary reference/class
For Windows-only, include a reference to "Microsoft Scripting Runtime"
For Windows and Mac, include VBA-Dictionary
References to be added
Download the sample file here.
Code:
Sub HitDotOnAMap()
Const Url As String = "https://www.arcgis.com/sharing/rest/content/items/4712740e6d6747d18cffc6a5fa5988f8/data?f=json"
Dim IE As New InternetExplorer, HTML As HTMLDocument
Dim post As Object, I&
Dim data As String, colObj As Object
With IE
.Visible = True
.navigate Url
While .Busy = True Or .readyState < 4: DoEvents: Wend
data = .document.body.innerHTML
data = Replace(Replace(data, "<pre>", ""), "</pre>", "")
End With
Dim JSON As Object
Set JSON = JsonConverter.ParseJson(data)
Set colObj = JSON("operationalLayers")(1)("featureCollection")("layers")(1)("featureSet")
For Each Item In colObj("features")
For j = 1 To Item("attributes").Count - 1
Debug.Print Item("attributes").Keys()(j), Item("attributes").Items()(j)
Next
Next
End Sub
Output
I've written a script in vba using IE to parse some links from a webpage. The thing is the links are within an iframe. I've twitched my code in such a way so that the script will first find a link within that iframe and navigate to that new page and parse the required content from there. If i do this way then I can get all the links.
Webpage URL: weblink
Successful approach (working one):
Sub Get_Links()
Dim IE As New InternetExplorer, HTML As HTMLDocument
Dim elem As Object, post As Object
With IE
.Visible = True
.navigate "put here the above link"
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set elem = .document.getElementById("compInfo") #it is within iframe
.navigate elem.src
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set HTML = .document
End With
For Each post In HTML.getElementsByClassName("news")
With post.getElementsByTagName("a")
If .Length Then R = R + 1: Cells(R, 1) = .Item(0).href
End With
Next post
IE.Quit
End Sub
I've seen few sites where no such links exist within iframe so, I will have no option to use any link to track down the content.
If you take a look at the below approach by tracking the link then you can notice that I've parsed the content from a webpage which are within Iframe. There is no such link within Iframe to navigate to a new webpage to locate the content. So, I used contentWindow.document instead and found it working flawlessly.
Link to the working code of parsing Iframe content from another site:
contentWindow approach
However, my question is: why should i navigate to a new webpage to collect the links as I can see the content in the landing page? I tried using contentWindow.document but it is giving me access denied error. How can I make my below code work using contentWindow.document like I did above?
I tried like this but it throws access denied error:
Sub Get_Links()
Dim IE As New InternetExplorer, HTML As HTMLDocument
Dim frm As Object, post As Object
With IE
.Visible = True
.Navigate "put here the above link"
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set HTML = .document
End With
''the code breaks when it hits the following line "access denied error"
Set frm = HTML.getElementById("compInfo").contentWindow.document
For Each post In frm.getElementsByClassName("news")
With post.getElementsByTagName("a")
If .Length Then R = R + 1: Cells(R, 1) = .Item(0).href
End With
Next post
IE.Quit
End Sub
I've attached an image to let you know which links (they are marked with pencil) I'm after.
These are the elements within which one such link (i would like to grab) is found:
<div class="news">
<span class="news-date_time"><img src="images/arrow.png" alt="">19 Jan 2018 00:01</span>
<a style="color:#5b5b5b;" href="/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17019039003&opt=9">ABB India Limited - Press Release</a>
</div>
Image of the links of that page I would like to grab:
From the very first day while creating this thread I strictly requested not to use this url http://hindubusiness.cmlinks.com/Companydetails.aspx?cocode=INE117A01022 to locate the data. I requested any solution from this main_page_link without touching the link within iframe. However, everyone is trying to provide solutions that I've already shown in my post. What did I put a bounty for then?
You can see the links within <iframe> in browser but can't access them programmatically due to Same-origin policy.
There is the example showing how to retrieve the links using XHR and RegEx:
Option Explicit
Sub Test()
Dim sContent As String
Dim sUrl As String
Dim aLinks() As String
Dim i As Long
' Retrieve initial webpage HTML content via XHR
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.thehindubusinessline.com/stocks/abb-india-ltd/overview/", False
.Send
sContent = .ResponseText
End With
'WriteTextFile sContent, CreateObject("WScript.Shell").SpecialFolders("Desktop") & "\tmp\tmp.htm", -1
' Extract target iframe URL via RegEx
With CreateObject("VBScript.RegExp")
.Global = True
.MultiLine = True
.IgnoreCase = True
' Process all a within div.news
.Pattern = "<iframe[\s\S]*?src=""([^""]*?Companydetails[^""]*)""[^>]*>"
sUrl = .Execute(sContent).Item(i).SubMatches(0)
End With
' Retrieve iframe HTML content via XHR
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", sUrl, False
.Send
sContent = .ResponseText
End With
'WriteTextFile sContent, CreateObject("WScript.Shell").SpecialFolders("Desktop") & "\tmp\tmp.htm", -1
' Parse links via XHR
With CreateObject("VBScript.RegExp")
.Global = True
.MultiLine = True
.IgnoreCase = True
' Process all anchors within div.news
.Pattern = "<div class=""news"">[\s\S]*?href=""([^""]*)"
With .Execute(sContent)
ReDim aLinks(0 To .Count - 1)
For i = 0 To .Count - 1
aLinks(i) = .Item(i).SubMatches(0)
Next
End With
End With
Debug.Print Join(aLinks, vbCrLf)
End Sub
Generally RegEx's aren't recommended for HTML parsing, so there is disclaimer. Data being processed in this case is quite simple that is why it is parsed with RegEx.
The output for me as follows:
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17047038016&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17046039003&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17045039006&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17043039002&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17043010019&opt=9
I also tried to copy the content of the <iframe> from IE to clipboard (for further pasting to the worksheet) using commands:
IE.ExecWB OLECMDID_SELECTALL, OLECMDEXECOPT_DODEFAULT
IE.ExecWB OLECMDID_COPY, OLECMDEXECOPT_DODEFAULT
But actually that commands select and copy the main document, excluding the frame, unless I click on the frame manually. So that might be applied if click on the frame could be reproduced from VBA (frame node methods like .focus and .click didn't help).
Something like this should work. They key is to realize the iFrame is technically another Document. Reviewing the iFrame on the page you listed, you can easily use a web request to get at the data you need. As already mentioned, the reason you get an error is due to the Same-Origin policy. You could write something to get the src of the iFrame then do the web request as I've shown below, or, use IE to scrape the page, get the src, then load that page which looks like what you have done.
I would recommend using a web request approach, Internet Explorer can get annoying, fast.
Code
Public Sub SOExample()
Dim html As Object 'To store the HTML content
Dim Elements As Object 'To store the anchor collection
Dim Element As Object 'To iterate the anchor collection
Set html = CreateObject("htmlFile")
With CreateObject("MSXML2.XMLHTTP")
'Navigate to the source of the iFrame, it's another page
'View the source for the iframe. Alternatively -
'you could navigate to this page and use IE to scrape it
.Open "GET", "https://stocks.thehindubusinessline.com/Companydetails.aspx?&cocode=INE117A01022"
.send ""
'See if the request was ok, exit it there was an error
If Not .Status = 200 Then Exit Sub
'Assign the page's HTML to an HTML object
html.body.InnerHTML = .responseText
Set Elements = html.body.document.getElementByID("hmstockchart_CompanyNews1_updateGLVV")
Set Elements = Elements.getElementsByTagName("a")
For Each Element In Elements
'Print out the data to the Immediate window
Debug.Print Element.InnerText
Next
End With
End Sub
Results
ABB India Limited - AGM/Book Closure
Board of ABB India recommends final dividend
ABB India to convene AGM
ABB India to pay dividend
ABB India Limited - Outcome of Board Meeting
More ?
The simple of solution like everyone suggested is to directly go the link. This would take the IFRAME out of picture and it would be easier for you loop through links. But in case you still don't like the approach then you need to get a bit deeper into the hole.
Below is a function from a library I wrote long back in VB.NET
https://github.com/tarunlalwani/ScreenCaptureAPI/blob/2646c627b4bb70e36fe2c6603acde4cee3354b39/Source%20Code/ScreenCaptureAPI/ScreenCaptureAPI/ScreenCapture.vb#L803
Private Function _EnumIEFramesDocument(ByVal wb As HTMLDocumentClass) As Collection
Dim pContainer As olelib.IOleContainer = Nothing
Dim pEnumerator As olelib.IEnumUnknown = Nothing
Dim pUnk As olelib.IUnknown = Nothing
Dim pBrowser As SHDocVW.IWebBrowser2 = Nothing
Dim pFramesDoc As Collection = New Collection
_EnumIEFramesDocument = Nothing
pContainer = wb
Dim i As Integer = 0
' Get an enumerator for the frames
If pContainer.EnumObjects(olelib.OLECONTF.OLECONTF_EMBEDDINGS, pEnumerator) = 0 Then
pContainer = Nothing
' Enumerate and refresh all the frames
Do While pEnumerator.Next(1, pUnk) = 0
On Error Resume Next
' Clear errors
Err.Clear()
' Get the IWebBrowser2 interface
pBrowser = pUnk
If Err.Number = 0 Then
pFramesDoc.Add(pBrowser.Document)
i = i + 1
End If
Loop
pEnumerator = Nothing
End If
_EnumIEFramesDocument = pFramesDoc
End Function
So basically this is a VB.NET version of below C++ version
Accessing body (at least some data) in a iframe with IE plugin Browser Helper Object (BHO)
Now you just need to port it to VBA. The only problem you may have is finding the olelib rerefernce. Rest most of it is VBA compatible
So once you get the array of object, you will find the one which belongs to your frame and then you can just that one
frames = _EnumIEFramesDocument(IE)
frames.Item(1).document.getElementsByTagName("A").length
I am a rookie in VBA excel.
There is a web page application in which
i need to click a button, the source of which is
<em class="x-btn-arow" unselectable="on">
<button class= x-btn-text" id="ext-gen7576" style="" type="button">Actions</button>
Sub xx()
Dim IE As Object
Dim doc As HTMLDocument
Dim l As IHTMLElement
Dim lo As IHTMLElementCollection
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
IE.Navigate "http://theapplicationlink"
Do
DoEvents
Loop Until IE.ReadyState = 4
Set doc = IE.Document
Set lo = doc.getElementsByTagName("button")
For Each l In lo
If l.getAttribute("class") = "x-btn-text" Then
l.click
End If
Next
End Sub
it doesn't throw any error but it doesn't click the button.
I cannot use ID as it keeps on changing each time i launch the application.
Also the class and type is same for other buttons also.
Forgive me for any technical errors
Any help will be a huge favour here.
There is an id. Does it change completely or does part of it remain the same? If it were you could partial match on the bit that remain the same using a CSS selector.
That aside you could use:
objIE.document.querySelector("button[class*= x-btn-text]").Click
This uses a CSS selector to target the element of button[class*= x-btn-text]. Which will be the first element with button tag having attribute class with value containing x-btn-text.
"button" is not a HTML tag. use a "Tag". let me give you an example here. Replace the "strTagName" with a HTML tage that inlcudes the thing you want to click.
Dim objTag As Object
For Each objTag In objIE.document.getElementsByTagName(strTagName)
If InStr(objTag.outerHTML, "x-btn-text") > 0 Then
objTag.Click
Exit For
End If
Next
This is my first post on stackflow :) I've been Googling VBA knowledge and writing some VBA for about a month.
My computer info:
1.window 8.1
2.excel 2013
3.ie 11
My excel reference
Microsoft Object Library: yes
Microsoft Internet Controls: yes
Microsoft Form 2.0 Object library: yes
Microsoft Script Control 1.0: yes
Issue:
I was trying to retrieve data from internet explorer automatically using VBA.
I would like to retrieve the value within an input tag from a id called "u_0_1" which is under a id called "facebook". I am expecting to retrieve the value "AQFFmT0qn1TW" on cell c2. However, it got this msg popped up after I run the VBA "run-time error '91':object variable or with block variable not set.
I have been trying this for a couple of weeks using different methods such as,
1.getelementsbyClassname
2.getelementbyid
3.getelementsbyTagname
But it just doesn't work.
url:
http://coursesweb.net/javascript/getelementsbytagname
Below is my VBA code. Could you guys help me out a little bit please?
Private Sub CommandButton1_Click()
Dim ie As Object
Dim Doc As HTMLDocument
Dim getThis As String
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = 0
ie.navigate "http://coursesweb.net/javascript/getelementsbytagname"
Do
DoEvents
Loop Until ie.readyState = 4
Set Doc = ie.document
getThis = Trim(Doc.getElementById("u_0_1")(0).getElementsByTagName("input")(0).Value)
Range("c2").Value = getThis
End Sub
Thanks for your help. I have no idea that there is difference between JS and VBA in aspect of getelementsby () methods. And using the loop method to find the id which I find it very useful as well.
I still have some issues to retrieve value from a form or input type. I hope that you could help me or give me some suggestions as well.
Expected Result:
retrieve the value "AQFFmT0qn1TW" and copy it on Cell ("c2") automatically.
Actual Result:
nothing return to Cell ("C2")
Below is the HTML elements.
<form rel="async" ajaxify="/plugins/like/connect" method="post" action="/plugins/like/connect" onsubmit="return window.Event && Event.__inlineSubmit && Event.__inlineSubmit(this,event)" id="u_0_1">
<input type="hidden" name="fb_dtsg" value="AQFFmT0qn1TW" autocomplete="off">
Below is the VBA code based on your code.
Private Sub CommandButton1_Click()
Dim ie As Object
Dim Doc As HTMLDocument
Dim Elements As IHTMLElementCollection
Dim Element As IHTMLElement
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = 0
ie.navigate "http://coursesweb.net/javascript/getelementsbytagname"
Do
DoEvents
Loop Until ie.readyState = 4
Set Doc = ie.document
Set Elements = Doc.getElementsByTagName("input")
For Each Element In Elements
If Element.name = "fb_dtsg" Then
Range("c2").Value = Element.innerText
End If
Next Element
Set Elements = Nothing
End Sub
Cheers.
first of all, I can't find in source of website tags you were searching. Anyway, I think you can't chain getElementById.getElementsByTag as in JS. You have to loop through collection of document elements.
Private Sub CommandButton1_Click()
Dim ie As Object
Dim Doc As HTMLDocument
Dim Elements As IHTMLElementCollection
Dim Element As IHTMLElement
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = 0
ie.navigate "http://coursesweb.net/javascript/getelementsbytagname"
Do
DoEvents
Loop Until ie.readyState = 4
Set Doc = ie.document
Set Elements = Doc.getElementsByTagName("ul")
For Each Element In Elements
If Element.ID = "ex4" Then
Sheets(1).Cells(1, 1).Value = Element.innerText
End If
Next Element
Set Elements = Nothing
End Sub
First I'm getting collection of tags "ul", then looping through them for id "ex4". In your case you'd get collection of "input"s then loop for id you want. Finding id which is followed by different id shouldn't be hard, just some if...thens.
If you need further assistant please respond with url in which I can find exactly what you're looking for.
Cheers
I need to scrape Title, product description and Product code and save it into worksheet from <<<HERE>>> in this case those are :
"Catherine Lansfield Helena Multi Bedspread - Double"
"This stunning ivory bedspread has been specially designed to sit with the Helena bedroom range. It features a subtle floral design with a diamond shaped quilted finish. The bedspread is padded so can be used as a lightweight quilt in the summer or as an extra layer in the winter.
Polyester.
Size L260, W240cm.
Suitable for a double bed.
Machine washable at 30°C.
Suitable for tumble drying.
EAN: 5055184924746.
Product Code 116/4196"
I have tried different methods and none was good for me in the end. For Mid and InStr functions result was none, it could be that my code was wrong. Sorry i do not give any code because i had already messed it up many times and have had no result. I have tried to scrape hole page with GetDatafromPage. It works well, but for different product pages the output goes to different rows as ammount of elements changes from page to page. Also it`s not possible to scrape only chosen elements. So it is pointless to get value from defined cells.
Another option instead of using the InternetExplorer object is the xmlhttp object. Here is a similar example to kekusemau but instead using xmlhttp object to request the page. I am then loading the responseText from the xmlhttp object in the html file.
Sub test()
Dim xml As Object
Set xml = CreateObject("MSXML2.XMLHTTP")
xml.Open "Get", "http://www.argos.co.uk/static/Product/partNumber/1164196.htm", False
xml.send
Dim doc As Object
Set doc = CreateObject("htmlfile")
doc.body.innerhtml = xml.responsetext
Dim name
Set name = doc.getElementById("pdpProduct").getElementsByTagName("h1")(0)
MsgBox name.innerText
Dim desc
Set desc = doc.getElementById("genericESpot_pdp_proddesc2colleft").getElementsByTagName("div")(0)
MsgBox desc.innerText
Dim id
Set id = doc.getElementById("pdpProduct").getElementsByTagName("span")(0).getElementsByTagName("span")(2)
MsgBox id.innerText
End Sub
This seems to be not too difficult. You can use Firefox to take a look at the page structure (right-click somewhere and click inspect element, and go on from there...)
Here is a simple sample code:
Sub test()
Dim ie As InternetExplorer
Dim x
Set ie = New InternetExplorer
ie.Visible = True
ie.Navigate "http://www.argos.co.uk/static/Product/partNumber/1164196.htm"
While ie.ReadyState <> READYSTATE_COMPLETE
DoEvents
Wend
Set x = ie.Document.getElementById("pdpProduct").getElementsByTagName("h1")(0)
MsgBox Trim(x.innerText)
Set x = ie.Document.getElementById("genericESpot_pdp_proddesc2colleft").getElementsByTagName("div")(0)
MsgBox x.innerText
Set x = ie.Document.getElementById("pdpProduct").getElementsByTagName("span")(0).getElementsByTagName("span")(2)
MsgBox x.innerText
ie.Quit
End Sub
(I have a reference in Excel to Microsoft Internet Controls, I don't know if that is there by default, if not you have to set it first to run this code).