Can't find any logic how the faulty script works flawlessly? - vba

I've written a script in vba using IE to get the titles of different hotel names from a webpage. The hotel names traverse multiple pages through pagination.
My scraper can keep clicking on the next button successfully while parsing the titles from each page until ther is no more click left to perform. The parser is doing is job just perfect. All I wish to know is a simple logic I've asked below.
My question: How the content of each page is rightly coming through even when I didn't use this Set Htmldoc = IE.document line just after the .click? When a click is initiated, the scraper goes to a new page with new content. How come it gets updated with new content from each page as my defined do loop comes after with IE block?
This is the script:
Sub GetTitles()
Const Url As String = "https://www.tripadvisor.com/Hotels-g147237-Caribbean-Hotels.html"
Dim IE As New InternetExplorer, Htmldoc As HTMLDocument, post As Object, R&
With IE
.Visible = True
.navigate Url
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set Htmldoc = .document
End With
Do
For Each post In Htmldoc.getElementsByClassName("listing") ''how this "Htmldoc" gets updated
With post.getElementsByClassName("property_title")
If .Length Then R = R + 1: Cells(R, 1) = .Item(0).innerText
End With
Next post
If Not Htmldoc.querySelector(".standard_pagination span[onclick*='pagination_next']") Is Nothing Then
Htmldoc.querySelector(".standard_pagination span[onclick*='pagination_next']").Click
Application.Wait Now + TimeValue("00:00:05")
''I didn't use anything like "Set Htmldoc = IE.document" but it still works flawlessly
Else:
Exit Do
End If
Loop
IE.Quit
End Sub

The script is not faulty. Though, you are using it without fully understanding is certainly troublesome.
When you do this Set Htmldoc = .document you are setting the IE's document for later use.
When you do this Htmldoc.querySelector(".standard_pagination span[onclick*='pagination_next']").Click javascript comes in play and updates the content of the page (i.e document).
You may believe that the document has changed but its only being updated. In reality,there is no navigation happening at all.
Add the following and see how the page/document remains the same, just the content changes.
'/ Url before Next button click
Debug.Print "Before Click " & Htmldoc.Url
Htmldoc.querySelector(".standard_pagination span[onclick*='pagination_next']").Click
'/ Url after Next button click
Debug.Print "After Click " & Htmldoc.Url
Since the document, once set remains the same and the updated content has same layout/DOM (that is how mostly programmers code, most likely all the pages are being rendered using a template) hence your code works perfectly fine. Net to net for your do loop, nothing changed.

Set Htmldoc = .document
gets a pointer to the DOM. When it changes the Htmldoc is pointing at the new content. No need to do a new Set Htmldoc

Related

Can't click on some dots to scrape information

I've written a script in vba in combination with IE to click on some dots available on a map in a web page. When a dot is clicked, a small box containing relevant information pops up.
Link to that website
I would like to parse the content of each box. The content of that box can be found using class name contentPane. However, the main concern here is to generate each box by clicking on those dots. When a box shows up, it looks how you can see in the below image.
This is the script I've tried so far:
Sub HitDotOnAMap()
Const Url As String = "https://www.arcgis.com/apps/Embed/index.html?webmap=4712740e6d6747d18cffc6a5fa5988f8&extent=-141.1354,10.7295,-49.7292,57.6712&zoom=true&scale=true&search=true&searchextent=true&details=true&legend=true&active_panel=details&basemap_gallery=true&disable_scroll=true&theme=light"
Dim IE As New InternetExplorer, HTML As HTMLDocument
Dim post As Object, I&
With IE
.Visible = True
.navigate Url
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set HTML = .document
End With
Application.Wait Now + TimeValue("00:0:07") ''the following line zooms in the slider
HTML.querySelector("#mapDiv_zoom_slider .esriSimpleSliderIncrementButton").Click
Application.Wait Now + TimeValue("00:0:04")
With HTML.querySelectorAll("[id^='NWQMC_VM_directory_'] circle")
For I = 0 To .Length - 1
.item(I).Focus
.item(I).Click
Application.Wait Now + TimeValue("00:0:03")
Set post = HTML.querySelector(".contentPane")
Debug.Print post.innerText
HTML.querySelector("[class$='close']").Click
Next I
End With
End Sub
when I execute the above script, it looks like it is running smoothly but nothing happens (I meant, no clicking) and it doesn't throw any error either. Finally it quits the browser gracefully.
This is how a box with information looks like when a dot gets clicked.
Although I've used hardcoded delay within my script, they can be fixed later as soon as the macro starts working.
Question: How can I click each of the dots on that map and collect the relevant information from the popped-up box? I only expect to have any solution using Internet Explorer
The data are not the main concern here. I would like to know how IE work in such cases so that I can deal with them in future cases. Any solution other than IE is not I'm looking for.
No need to click on each dots. Json file has all the details and you can extract as per your requirement.
Installation of JsonConverter
Download the latest release
Import JsonConverter.bas into your project (Open VBA Editor, Alt + F11; File > Import File)
Add Dictionary reference/class
For Windows-only, include a reference to "Microsoft Scripting Runtime"
For Windows and Mac, include VBA-Dictionary
References to be added
Download the sample file here.
Code:
Sub HitDotOnAMap()
Const Url As String = "https://www.arcgis.com/sharing/rest/content/items/4712740e6d6747d18cffc6a5fa5988f8/data?f=json"
Dim IE As New InternetExplorer, HTML As HTMLDocument
Dim post As Object, I&
Dim data As String, colObj As Object
With IE
.Visible = True
.navigate Url
While .Busy = True Or .readyState < 4: DoEvents: Wend
data = .document.body.innerHTML
data = Replace(Replace(data, "<pre>", ""), "</pre>", "")
End With
Dim JSON As Object
Set JSON = JsonConverter.ParseJson(data)
Set colObj = JSON("operationalLayers")(1)("featureCollection")("layers")(1)("featureSet")
For Each Item In colObj("features")
For j = 1 To Item("attributes").Count - 1
Debug.Print Item("attributes").Keys()(j), Item("attributes").Items()(j)
Next
Next
End Sub
Output

Unable to parse some links lying within an iframe

I've written a script in vba using IE to parse some links from a webpage. The thing is the links are within an iframe. I've twitched my code in such a way so that the script will first find a link within that iframe and navigate to that new page and parse the required content from there. If i do this way then I can get all the links.
Webpage URL: weblink
Successful approach (working one):
Sub Get_Links()
Dim IE As New InternetExplorer, HTML As HTMLDocument
Dim elem As Object, post As Object
With IE
.Visible = True
.navigate "put here the above link"
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set elem = .document.getElementById("compInfo") #it is within iframe
.navigate elem.src
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set HTML = .document
End With
For Each post In HTML.getElementsByClassName("news")
With post.getElementsByTagName("a")
If .Length Then R = R + 1: Cells(R, 1) = .Item(0).href
End With
Next post
IE.Quit
End Sub
I've seen few sites where no such links exist within iframe so, I will have no option to use any link to track down the content.
If you take a look at the below approach by tracking the link then you can notice that I've parsed the content from a webpage which are within Iframe. There is no such link within Iframe to navigate to a new webpage to locate the content. So, I used contentWindow.document instead and found it working flawlessly.
Link to the working code of parsing Iframe content from another site:
contentWindow approach
However, my question is: why should i navigate to a new webpage to collect the links as I can see the content in the landing page? I tried using contentWindow.document but it is giving me access denied error. How can I make my below code work using contentWindow.document like I did above?
I tried like this but it throws access denied error:
Sub Get_Links()
Dim IE As New InternetExplorer, HTML As HTMLDocument
Dim frm As Object, post As Object
With IE
.Visible = True
.Navigate "put here the above link"
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set HTML = .document
End With
''the code breaks when it hits the following line "access denied error"
Set frm = HTML.getElementById("compInfo").contentWindow.document
For Each post In frm.getElementsByClassName("news")
With post.getElementsByTagName("a")
If .Length Then R = R + 1: Cells(R, 1) = .Item(0).href
End With
Next post
IE.Quit
End Sub
I've attached an image to let you know which links (they are marked with pencil) I'm after.
These are the elements within which one such link (i would like to grab) is found:
<div class="news">
<span class="news-date_time"><img src="images/arrow.png" alt="">19 Jan 2018 00:01</span>
<a style="color:#5b5b5b;" href="/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17019039003&opt=9">ABB India Limited - Press Release</a>
</div>
Image of the links of that page I would like to grab:
From the very first day while creating this thread I strictly requested not to use this url http://hindubusiness.cmlinks.com/Companydetails.aspx?cocode=INE117A01022 to locate the data. I requested any solution from this main_page_link without touching the link within iframe. However, everyone is trying to provide solutions that I've already shown in my post. What did I put a bounty for then?
You can see the links within <iframe> in browser but can't access them programmatically due to Same-origin policy.
There is the example showing how to retrieve the links using XHR and RegEx:
Option Explicit
Sub Test()
Dim sContent As String
Dim sUrl As String
Dim aLinks() As String
Dim i As Long
' Retrieve initial webpage HTML content via XHR
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.thehindubusinessline.com/stocks/abb-india-ltd/overview/", False
.Send
sContent = .ResponseText
End With
'WriteTextFile sContent, CreateObject("WScript.Shell").SpecialFolders("Desktop") & "\tmp\tmp.htm", -1
' Extract target iframe URL via RegEx
With CreateObject("VBScript.RegExp")
.Global = True
.MultiLine = True
.IgnoreCase = True
' Process all a within div.news
.Pattern = "<iframe[\s\S]*?src=""([^""]*?Companydetails[^""]*)""[^>]*>"
sUrl = .Execute(sContent).Item(i).SubMatches(0)
End With
' Retrieve iframe HTML content via XHR
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", sUrl, False
.Send
sContent = .ResponseText
End With
'WriteTextFile sContent, CreateObject("WScript.Shell").SpecialFolders("Desktop") & "\tmp\tmp.htm", -1
' Parse links via XHR
With CreateObject("VBScript.RegExp")
.Global = True
.MultiLine = True
.IgnoreCase = True
' Process all anchors within div.news
.Pattern = "<div class=""news"">[\s\S]*?href=""([^""]*)"
With .Execute(sContent)
ReDim aLinks(0 To .Count - 1)
For i = 0 To .Count - 1
aLinks(i) = .Item(i).SubMatches(0)
Next
End With
End With
Debug.Print Join(aLinks, vbCrLf)
End Sub
Generally RegEx's aren't recommended for HTML parsing, so there is disclaimer. Data being processed in this case is quite simple that is why it is parsed with RegEx.
The output for me as follows:
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17047038016&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17046039003&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17045039006&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17043039002&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17043010019&opt=9
I also tried to copy the content of the <iframe> from IE to clipboard (for further pasting to the worksheet) using commands:
IE.ExecWB OLECMDID_SELECTALL, OLECMDEXECOPT_DODEFAULT
IE.ExecWB OLECMDID_COPY, OLECMDEXECOPT_DODEFAULT
But actually that commands select and copy the main document, excluding the frame, unless I click on the frame manually. So that might be applied if click on the frame could be reproduced from VBA (frame node methods like .focus and .click didn't help).
Something like this should work. They key is to realize the iFrame is technically another Document. Reviewing the iFrame on the page you listed, you can easily use a web request to get at the data you need. As already mentioned, the reason you get an error is due to the Same-Origin policy. You could write something to get the src of the iFrame then do the web request as I've shown below, or, use IE to scrape the page, get the src, then load that page which looks like what you have done.
I would recommend using a web request approach, Internet Explorer can get annoying, fast.
Code
Public Sub SOExample()
Dim html As Object 'To store the HTML content
Dim Elements As Object 'To store the anchor collection
Dim Element As Object 'To iterate the anchor collection
Set html = CreateObject("htmlFile")
With CreateObject("MSXML2.XMLHTTP")
'Navigate to the source of the iFrame, it's another page
'View the source for the iframe. Alternatively -
'you could navigate to this page and use IE to scrape it
.Open "GET", "https://stocks.thehindubusinessline.com/Companydetails.aspx?&cocode=INE117A01022"
.send ""
'See if the request was ok, exit it there was an error
If Not .Status = 200 Then Exit Sub
'Assign the page's HTML to an HTML object
html.body.InnerHTML = .responseText
Set Elements = html.body.document.getElementByID("hmstockchart_CompanyNews1_updateGLVV")
Set Elements = Elements.getElementsByTagName("a")
For Each Element In Elements
'Print out the data to the Immediate window
Debug.Print Element.InnerText
Next
End With
End Sub
Results
ABB India Limited - AGM/Book Closure
Board of ABB India recommends final dividend
ABB India to convene AGM
ABB India to pay dividend
ABB India Limited - Outcome of Board Meeting
More ?
The simple of solution like everyone suggested is to directly go the link. This would take the IFRAME out of picture and it would be easier for you loop through links. But in case you still don't like the approach then you need to get a bit deeper into the hole.
Below is a function from a library I wrote long back in VB.NET
https://github.com/tarunlalwani/ScreenCaptureAPI/blob/2646c627b4bb70e36fe2c6603acde4cee3354b39/Source%20Code/ScreenCaptureAPI/ScreenCaptureAPI/ScreenCapture.vb#L803
Private Function _EnumIEFramesDocument(ByVal wb As HTMLDocumentClass) As Collection
Dim pContainer As olelib.IOleContainer = Nothing
Dim pEnumerator As olelib.IEnumUnknown = Nothing
Dim pUnk As olelib.IUnknown = Nothing
Dim pBrowser As SHDocVW.IWebBrowser2 = Nothing
Dim pFramesDoc As Collection = New Collection
_EnumIEFramesDocument = Nothing
pContainer = wb
Dim i As Integer = 0
' Get an enumerator for the frames
If pContainer.EnumObjects(olelib.OLECONTF.OLECONTF_EMBEDDINGS, pEnumerator) = 0 Then
pContainer = Nothing
' Enumerate and refresh all the frames
Do While pEnumerator.Next(1, pUnk) = 0
On Error Resume Next
' Clear errors
Err.Clear()
' Get the IWebBrowser2 interface
pBrowser = pUnk
If Err.Number = 0 Then
pFramesDoc.Add(pBrowser.Document)
i = i + 1
End If
Loop
pEnumerator = Nothing
End If
_EnumIEFramesDocument = pFramesDoc
End Function
So basically this is a VB.NET version of below C++ version
Accessing body (at least some data) in a iframe with IE plugin Browser Helper Object (BHO)
Now you just need to port it to VBA. The only problem you may have is finding the olelib rerefernce. Rest most of it is VBA compatible
So once you get the array of object, you will find the one which belongs to your frame and then you can just that one
frames = _EnumIEFramesDocument(IE)
frames.Item(1).document.getElementsByTagName("A").length

VBA: New (or Redefined?) Internet Explorer Object In Same Window

I'm creating a macro that will navigate to a login page, log in, navigate to another page and scrape data, and then loop through 100-200 more pages scraping data from each.
So far I've gotten it to the point of logging in, navigating to the second page, and scraping the first bit of data. But so far the only way I can get it to work is if the second page opens in a new window. Since I ultimately have to go through 100-200 pages, I'd rather not use a new window for each one.
For this example let's just say that the only data I'm trying to scrape is the page title.
Option Explicit
Sub admin_scraper()
Dim ie As Object
Dim doc As Object
' Get through log in page
Set ie = CreateObject("internetexplorer.application")
With ie
.navigate "http://example.com/login" 'Page title is "Page 1"
.Visible = True
End With
While ie.Busy Or ie.readyState <> 4
DoEvents
Wend
ie.document.forms(0).all("Username").Value = "user"
ie.document.forms(0).all("Password").Value = "abc123"
ie.document.forms(0).submit
'Navigate to second page and pull page title
Set ie = CreateObject("internetexplorer.application") '***Line in question
With ie
.navigate "http://example.com/Products" 'Page title is "Page 2"
.Visible = True
End With
While ie.Busy Or ie.readyState <> 4
DoEvents
Wend
Set doc = ie.document
Debug.Print doc.Title
End Sub
*** If I include this line the code works as expected (console prints "Page 2"), but it opens the second page in a new window. If I don't include this line, the second page opens smoothly in the same window, but the console prints "Page 1."
Any way I can get it to open each new page in the same window while making sure it pulls data from the new page? Or if it has to be in a new window, any way to automatically close the old window each time?

Click button of java build website using VBA code

I;m in the process of learning the web scraping using VBA and I've come across the website listed within my code which is Java build.
My goal is to click Next button from the VBA code, but I can not identify it when looking at "behind the scene" code. When using "Inspect" element i see the button reference but when I list all the links in my excel it is not there.
Then I looked at the page using CTR-U (in Chrome) and the page looks completely different and has a lot of Java Code. I am not familiar with Java therefore could you please describe how is the mechanism of decoding such a Java page?
Here is my code:
Sub first_template()
Dim ie As New InternetExplorer
Dim doc As HTMLDocument
ie.Top = 0
ie.Left = 0
ie.Visible = True
ie.navigate "http://www.amaaonline.com/member-directory/"
Do
DoEvents
Loop Until ie.readyState = READYSTATE_COMPLETE
Set input_elements = ie.document.getElementsByTagName("a")
i = 4
For Each element In input_elements
Cells(i, 1).Value = element.innerHTML
i = i + 1
Next element
ie.Quit
Set ie = Nothing
End Sub

How to automate a dynamically changing web page using Excel VBA?

I have been trying to automate a web page since two weeks but I could not proceed further after 3rd page.
First I'm logging into login page by giving credentials and then I would click a link from 2nd page. Until this point I'm fine; but after that again I need to click another link from the 3rd page that I'm not able to, Even I was not able to read the proper innerhtml of that particular page. The innerhtmal varies from the source code of that page. Using the source code I have taken the id/name to get the element but no use. The problem I'm seeing is the DOCUMENT object is not taking the inner details of 3rd page. When I tried to print the links of that page it printed me some common links in that page which would be available in all the pages instead of printing all the links in that particular page. I guess this might happen because the page frame varies with respect to the FromDate & ToDate. Pardon me if I'm wrong. Do we need to change every time the "ie.document" object with respect to the navigation of web page? Because I think it sticks with the same when the page loaded 1st time.
Below is my code:
Public Sub Test ()
Dim shellWins As ShellWindows
Dim ie As InternetExplorer
Dim doc As HTMLDocument
Dim frm As HTMLFrameElement
Dim frms As HTMLElementCollection
Dim strSQL As String
Dim Login As Boolean
strSQL = "https://website.com"
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = True
.Navigate strSQL
Do Until .ReadyState = 4: DoEvents: Loop
Set doc = ie.document
Dim link As Object
For Each link In doc.Links
'Debug.Print link.innerText
If link.innerText = "Click Here" Then
link.Click
Exit For
End If
Next link
Do While ie.Busy: DoEvents: Loop
Login_Pane:
For Each link In doc.Links
If link.innerText = "Leave & Attendance" Then
'Debug.Print doc.body.innerHTML
link.Click
Login = True
Exit For
End If
Next link
If Login <> True Then
Application.Wait (Now + TimeValue("00:00:02"))
Application.SendKeys "<USERNAME>", True
Application.SendKeys "{TAB}"
Application.Wait (Now + TimeValue("00:00:02"))
Application.SendKeys "<PASSWORD>", True
Application.SendKeys "{ENTER}"
GoTo Login_Pane
End If
Do While ie.Busy: DoEvents: Loop
Dim link As Object
For Each link In doc.Links
Debug.Print link.innerText
' Above line code should print all the links in that page_
_but unfortunatly it is not displaying as it is in the source code.
' instead printing half of the links which are commonly_ _available in all pages.
' This page has three frames
Next link
End With
'IE.Quit
End Sub
i'm unable to post the image of that page to make you understand more, Anyways i'll try my best.
when i use this below code i can only able to get the links from the upper portion of the page.
Set doc = ie.document
Dim text As Object
For Each text In doc.Links
Debug.Print text.innerText
Next text
Below to that portion of the page i have option to enter FromDate & ToDate, by giving dates to this textboxes i'll be able to see the details according to the dates (by default page displayes the details from 1st of the curent month to the current date of the month).
So, here i'm not getting the links/or other details. And i think the details of this sections are not stored in the ie.document object.
And this particular section alone has different URL from the main page.
Thanks.
A couple of thoughts:
For a page that dynamically loads you need to use Application.Wait (5 seconds or so) instead of Do Until .ReadyState = 4: DoEvents: Loop. The latter does not work if you have javascript being executed.
Using SendKeys should always be avoided as it is not robust.Inspect the element with a DOM explorer to get the ID or name.