Excel with VBA - XmlHttp to use div - vba

I am using excel with VBA to open a page and extract some information and putting it in my database. After some research, I figured out that opening IE obviously takes more time and it can be achieved using XmlHTTP. I am using the XmlHTTP to open a web page as proposed in my another question. However, while using IE I was able to navigate through div tags. How can I accomplish the same in XmlHTTP?
If I use IE to open the page, I am doing something like below to navigate through multiple div elements.
Set openedpage1 = iedoc1.getElementById("profile-experience").getElementsbyClassName("title")
For Each div In openedpage1
---------
However, with XmlHttp, I am not able to do like below.
For Each div In html.getElementById("profile-experience").getElementsbyClassName("title")
I am getting an error as object doesn't support this property or method.

Take a look at this answer that I had posted for another question as this is close to what you're looking for. In summary, you will:
Create a Microsoft.xmlHTTP object
Use the xmlHTTP object to open your url
Load the response as XML into a DOMDOcument object
From there you can get a set of XMLNodes, select elements, attributes, etc. from the DOMDocument

The XMLHttp object returns the contents of the page as a string in responseText. You will need to parse this string to find the information you need. Regex is an option but it will be quite cumbersome.
This page uses string functions (Mid, InStr) to extract information from the html-text.
It may be possible to create a DOMDocument from the retreived HTML (I believe it is) but I haven't pursued this.

As mentioned with the answers above put the .responseText into an HTMLDocument and then work with that object e.g.
Option Explicit
Public Sub test()
Dim html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", "http://www.someurl.com", False
.send
html.body.innerHTML = .responseText
End With
Dim aNodeList As Object, iItem As Long
Set aNodeList = html.querySelectorAll("#profile-experience.title")
With ActiveSheet
For iItem = 0 To aNodeList.Length - 1
.Cells(iItem + 1, 1) = aNodeList.item(iItem).innerText
'.Cells(iItem + 1, 1) = aNodeList(iItem).innerText '<== or potentially this syntax
Next iItem
End With
End Sub
Note:
I have literally translated your getElementById("profile-experience").getElementsbyClassName("title") into a CSS selector, querySelectorAll("#profile-experience.title"), so assume that you have done that correctly.

Related

Import web source code including not displayed on page

I want to import the web page source code in excel what I see using View Page Source option in Chrome. But when I import it using below code, it doesn't import all content. The values that I'm looking for do not get displayed on web page.
I'm also unable to locate the element using getElementsByClassName or other methods.
Private Sub HTML_VBA_Excel()
Dim oXMLHTTP As Object
Dim sPageHTML As String
Dim sURL As String
'Change the URL before executing the code
sURL = "http://pntaconline/getPrDetails?entry=8923060"
'Extract data from website to Excel using VBA
Set oXMLHTTP = CreateObject("MSXML2.ServerXMLHTTP")
oXMLHTTP.Open "GET", sURL, False
oXMLHTTP.send
sPageHTML = oXMLHTTP.responseText
'Get webpage data into Excel
' If longer sourcecode mean, you need to save to a external text file or somewhere,
' since excel cell have some limits on storing max characters
ThisWorkbook.Sheets(1).Cells(1, 1) = sPageHTML
MsgBox "XMLHTML Fetch Completed"
End Sub
Data I want to import is IDs and Name:
So you need to understand the DOM in order to realize why this isnt loading everything.
XMLHTTP is going to load that specific resource you requested. A lot of web pages, sorry pretty much all web pages, load extra resources after the initial request is done.
If you're missing stuff, it's probably loaded on a different network request. So open up your DevTools in Chrome, make sure Network tab is recording, and watch how many network requests go in and out when you load your target page.
Essentially, this if you're using XMLHTTP, you'd have to simulate each of those to get the requests you want to scrape.
EDIT
So you're just kind of pasting the data response into Excel.
Better to create HTMLDocument variable then set the response from XMLHTTP to be the response like here: https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms762275(v=vs.85)
set xmlhttp = new ActiveXObject("Msxml2.XMLHTTP.3.0");
xmlhttp.open("GET", "http://localhost/books.xml", false);
xmlhttp.send();
Debug.print(xmlhttp.responseText);
Dim xString as String
xSring = xmlhttp.responseText
'search the xString variable
You can then split that response for the sheet or search it and extract the values in VBA memory, rather than print to the sheet.
You could also set the xString responseText as the innerHTML for a new HTMLDocument variable
Dim xHTML as HTMLDocument
Set xHTML.innertext = xString

Trouble scraping the names from a certain website

I've come across such a webpage which seems to me a bit misleading
to scrape. When I go the address "https://jobboerse2.arbeitsagentur.de/jobsuche/?s=1" it takes me to a page with "suchen" option. After clicking "suchen" it opens a new layout within this tab and takes me to a page with lots of names. So, the site address is same again "https://jobboerse2.arbeitsagentur.de/jobsuche/?s=1".
I would like to scrape the names of that page, as in "Mitarbeiter für die Leerguttrennung (m/w)". Any help would be highly appreciated.
What I wrote so far:
Sub WebData()
Dim http As New MSXML2.xmlhttp60
Dim html As New htmldocument, source As Object, item As Object
With http
.Open "GET", "https://jobboerse2.arbeitsagentur.de/jobsuche/?s=1", False
.send
html.body.innerHTML = .responseText
End With
Set source = html.getElementsByClassName("ng-binding ng-scope")
For Each item In source
x = x + 1
Cells(x, 1) = item.innerText
Next item
Set html = Nothing: Set source = Nothing
End Sub
The links are incremented like these as per xhr in developer tool but can't figure out what is the number of the last link.
"https://jobboerse2.arbeitsagentur.de/jobsuche/pc/v1/jobs"
"https://jobboerse2.arbeitsagentur.d...00&FCT.ANGEBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS"
"https://jobboerse2.arbeitsagentur.d...EBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS&offset=12"
"https://jobboerse2.arbeitsagentur.d...EBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS&offset=24"
"https://jobboerse2.arbeitsagentur.d...EBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS&offset=36"

Using MSXML to fetch data from website

I am trying to use the following code to geocode a bunch of cities from this website: mygeoposition.com but there seems to be some sort of issue and the variable 'Lat' in the following code always returns empty:
Sub Code()
Dim IE As MSXML2.XMLHTTP60
Set IE = New MSXML2.XMLHTTP60
IE.Open "GET", "http://mygeoposition.com/?q=Chuo-ku, Osaka", False
IE.send
While IE.ReadyState <> 4
DoEvents
Wend
Dim HTMLDoc As MSHTML.HTMLDocument
Dim htmlBody As MSHTML.htmlBody
Set HTMLDoc = New MSHTML.HTMLDocument
Set htmlBody = HTMLDoc.body
htmlBody.innerHTML = IE.responseText
Lat = HTMLDoc.getElementById("geodata-lat").innerHTML
IE.abort
End Sub
I have another code that uses the browser to do the same thing and it works fine with that but it gets quite slow. When I use this code with MSXML, it just doesn't work. Apologies I am a newbie with using VBA for data extraction from website. Please help.
The response contains no content in the geodata-lat element. It appears that client side code is getting that data so your response only is looking at the html that the server generated. I tried this out myself and this is the section of the response you are looking for. You can see it is empty:
If you try an element that has content (geodata-kml-button), it does pull in a value ("Download KML file"). Ignore the ByteArrayToString() call, that was just me testing:
If they don't have a real API then I don't think you can get your data this way.

MSXML2.XMLHTTP page request: How do you make sure you get ALL of the final HTML code?

I've used this simple subroutine for loading HTML documents from the web for some time now with no problems:
Function GetSource(sURL As String) As Variant
' Purpose: To obtain the HTML text of a web page
' Receives: The URL of the web page
' Returns: The HTML text of the web page in a variant
Dim oXHTTP As Object, n As Long
Set oXHTTP = CreateObject("MSXML2.XMLHTTP")
oXHTTP.Open "GET", sURL, False
oXHTTP.send
GetSource = oXHTTP.responsetext
Set oXHTTP = Nothing
End Function
but I've run into a situation where it only loads part of a page most of the time (not always -- sometimes it loads all of the expected HTML code). If you SAVE the HTML of the page to another file on the web from a browser, the subroutine will always read it with no problem.
I'm guessing that the issue is timing -- that the dynamic page registers "done" while a script is still filling in details. Sometimes it completes in time, other times it doesn't.
Has anyone ever encountered this behavior before and surmounted it? It seems that there should be a way of capturing via the MSXML2.XMLHTTP object exactly what you'd get if went to the page and chose the save to HTML option.
If you'd like to see the behavior for yourself, here's a sample of a page that doesn't load consistently:
http://www.tiff.net/festivals/thefestival/programmes/specialpresentations/mr-turner
and here's a saved HTML file of that same page:
http://tofilmfest.ca/2014/film/fest/Mr_Turner.htm
Is there any known workaround for this?
I found a workaround that gives me what I want. I control Internet Explorer programmatically and invoke a three-second delay after I tell it to navigate to a page to enable the content to finish loading. Then I extract the HTML code by using an IHTMLElement from Microsoft's HTML library. It's not pretty, but it retrieves all of the HTML code for every page I've tried it with. If anybody has a better way accomplishing the same end, feel free to show off.
Function testbrowser() As Variant
Dim oIE As InternetExplorer
Dim hElm As IHTMLElement
Set oIE = New InternetExplorer
oIE.Height = 600
oIE.Width = 800
oIE.Visible = True
oIE.Navigate "http://www.tiff.net/festivals/thefestival/programmes/galapresentations/the-riot-club"
Call delay(3)
Set hElm = oIE.Document.all.tags("html").Item(0)
testbrowser = hElm.outerHTML
End Function
Sub delay(ByVal secs As Integer)
Dim datLimit As Date
datLimit = DateAdd("s", secs, Now())
While Now() < datLimit
Wend
End Sub
Following Alex's suggestion, here's how to do it without a brute force fixed delay:
Function GetHTML(ByVal strURL as String) As Variant
Dim oIE As InternetExplorer
Dim hElm As IHTMLElement
Set oIE = New InternetExplorer
oIE.Navigate strURL
Do While (oIE.Busy Or oIE.ReadyState <> READYSTATE_COMPLETE)
DoEvents
Loop
Set hElm = oIE.Document.all.tags("html").Item(0)
GetHTML = hElm.outerHTML
Set oIE = Nothing
Set hElm = Nothing
End Function

Crawl to final URL from within Excel VBA

I have a list of domain names, and many of them redirect me to the same domain. For instance... foo1.com, foo2.csm and foo3.com all deposit me at foo.com.
I'm trying to deduplicate the list of domains by writing a VBA script to load the final page and extract it's URL.
I've started from this article which retrieve's the page's title (http://www.excelforum.com/excel-programming-vba-macros/355192-can-i-import-raw-html-source-code-into-excel.html), but can't figure out how to modify it to get the final URL (from which I can extract the URL.
Can anyone please point me in the right direction?
Add a reference to "Microsoft XML, v3.0" (or whatever version you have)
Sub tester()
Debug.Print CheckRedirect("adhpn2.com")
End Sub
Function CheckRedirect(URL As String)
If Not UCase(URL) Like "HTTP://*" Then URL = "http://" & URL
With New msxml2.ServerXMLHTTP40
.Open "HEAD", URL, False
.send
CheckRedirect = .getOption(-1)
End With
End Function
Try this, need to look at .LocationURL:
Public Function gsGetFinalURL(rsURL As String) As String
Dim ie As Object
Set ie = CreateObject("InternetExplorer.Application")
With ie
.navigate rsURL
Do While .Busy And Not .ReadyState = 4
DoEvents
Loop
gsGetFinalURL = .LocationURL
.Quit
End With
Set ie = Nothing
End Function
I haven't tried it on a huge variety of URLs, just the one you gave and a couple of others. If it is an invalid URL it will return what is passed. You can use the code from the original function to check and handle accordingly.