Generate Webpage Titles from List of URLs in Excel - vba

I have a column in Excel of website URLs and would like to automatically retrieve the webpage titles for each of those URLs and place them in a second corresponding column.
I don't have any experience with VBA, but would like to use VBA to do this. Can someone please list the steps one-by-one to get this done? What code do I paste and where? How do I point the code to the right column to pick up the URLs and how do I tell it where to populate the results? How do I run the code?
I think this is a popular problem but many don't uses the existing documentation on how to address it because they have the same issue as me -- they don't know how to load and run the script.
Any help would be greatly appreciated! Please be as detailed as possible.

Adjust the range "A1:A10" to match your data....
Sub GetTitles()
Dim c As Range, url As String
For Each c In Range("A1:A10").Cells
url = Trim(c.Value)
If LCase(url) Like "http://*" Then
c.Offset(0, 1).Value = GetTitle(url)
End If
Next c
End Sub
Function GetTitle(sURL As String)
Dim title As String, res As String, pos1, pos2
Dim objHttp As Object
Set objHttp = CreateObject("MSXML2.ServerXMLHTTP")
objHttp.Open "GET", sURL, False
objHttp.Send ""
res = objHttp.ResponseText
pos1 = InStr(1, UCase(res), "<TITLE>")
pos2 = InStr(1, UCase(res), "</TITLE>")
title = "<not found>"
If pos1 > 0 And pos2 > 0 Then
pos1 = pos1 + Len("<TITLE>")
title = Mid(res, pos1, pos2 - pos1)
End If
GetTitle = title
End Function

Related

how to pull elements from a list in vba

I have extract something like this from a database:
[{"identifier":{"strategyType":"element1"},"elnSchedules":[{"paymentDate":["element2","element2"]},{"paymentDate":["element2","element2"]}],"composition":{"components":[{"instrument":{"exerciseType":["element3","element3"]}},{"instrument":{"exerciseType":["element3","element3"]}}]},"links":[]}]
I want to build a vba and create a table in excel that has header: strategyType, paymentDate, exerciseType
and elements: element1, element2, element3 under the corresponding header(while each element can only appear once).
so far I have:
For i = 1 To jsonO.Count
'set headings
If i = 1 Then
j = 1
For Each StrKey In jsonO(i).Keys()
activeWS.Cells(i + offset, j) = StrKey
j = j + 1
Next
End If
j = 1
For Each StrKey In jsonO(i).Keys()
If (StrKey <> "links") Then
activeWS.Cells(i + offset + 1, j) = jsonO(i)(StrKey)
j = j + 1
End If
Next
But this only extracts identifier, eLnschedules, and composition and not able to get into specific element.
Is there any way to do it?
Thanks.
-------------------NEW QUESTION----------------------
I used what's posted on the answer and was trying to build a new function under the function posted in the answer(while both of them are called by the main function):
Public Sub GetEndDate()
Dim activeWS As Worksheet
Set activeWS = ThisWorkbook.Worksheets("Data")
Dim jsonStr As String, Json As Object, headers()
'headers = Array("strategyType", "paymentDate", "exerciseType")
jsonStr = [{"optionFeatures":{"Strike Setting":[{"endDate":["2018-10-16"]}]},"links":[]}] '<== read from cell
Set Json = JsonConverter.ParseJson(jsonStr)(1)
activeWS.Cells(1, 13) = Json("optionFeatures")("Strike Setting")("endDate")
End Sub
However it was not able to read from the string, or do I need to reset the lib again?
Thanks.
Using JSONConverter.bas to parse the JSON string read in from a cell as shown below. This assumes you only want one instance of each value.
Note:
After adding in JSONConverter.bas you need to go VBE > Tools > References > Add a reference to Microsoft Scripting Runtime.
Your JSON structure is as follows:
[] indicates a collection, items accessed by index starting from 1. {} indicates a dictionary with items accessed by key.
I traverse the tree using the appropriate syntax to retrieve the first occurrence of each element.
Option Explicit
Public Sub GetInfoFromSheet()
Dim jsonStr As String, Json As Object, headers()
headers = Array("strategyType", "paymentDate", "exerciseType")
jsonStr = [A1] '<== read from cell
Set Json = JsonConverter.ParseJson(jsonStr)(1)
With ThisWorkbook.Worksheets("Sheet1")
.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
.Cells(2, 1) = Json("identifier")("strategyType")
.Cells(2, 2) = Json("elnSchedules")(1)("paymentDate")(1)
.Cells(2, 3) = Json("composition")("components")(1)("instrument")("exerciseType")(1)
End With
End Sub

OpenOffice Calc Merged Cell properties

I've been searching the web for hours and I can't find an answer for this. I can identify that a given cell is merged but I need to know how many rows are included in the merged cell.
Function aac_MergeRowCount(intStartCol as Integer, intStartRow as Integer)
oSheet = ThisComponent.CurrentController.ActiveSheet
oCell = oSheet.GetCellByPosition(intStartCol, intStartRow)
strData = oCell.GetString()
If oCell.IsMerged Then
strCopy = strData
strWasMerged = True
iCount = oCell.GetNumberOfRows()
End If
aac_MergeRowCount = iCount
End Function
Turns out I just needed to ditch OpenOffice and use Excel. The command I was needing is:
strData = oSheet.Cells(r, c).MergeArea.Rows.Count
Which did not work in OpenOffice

Can't arrange items to the expected position

I've witten a script in vba to parse two categories from each container from a webpage. The scraper is able to parse them accordingly. The problem I'm facing at this moment is that I can't place these items across columns. If a column contains views, the next column should contains votes and so on. The way I'm expecting the result is more like:
column1 column2 column3 column4
9 views 0 vote 10 views -2
This is my script so far:
Sub CollectInfo()
Const URL As String = "https://stackoverflow.com/questions/tagged/web-scraping"
Dim Http As New XMLHTTP60, Html As New HTMLDocument
Dim post As HTMLHtmlElement, R&, C&
With Http
.Open "GET", URL, False
.send
Html.body.innerHTML = .responseText
End With
R = 1
For Each post In Html.getElementsByClassName("question-summary")
C = C + 1: Cells(R, C) = post.getElementsByClassName("views")(0).innerText
Cells(R, C + 1) = post.getElementsByClassName("votes")(0).innerText
Next post
End Sub
The way I tried is definitely leading me to the wrong placing. How can I fix it to serve the purpose? Btw, I do not wish to go for the offset (I meant Range("A1").offset(,1)") looping ;rather, I wanna stick to the way I tried above. Thanks.
This will show views and votes by turns. I changed XMLHTTP60 to MSXML2.XMLHTTP60, because on my end it causes automation error.
Sub CollectInfo()
Const URL As String = "https://stackoverflow.com/questions/tagged/web-scraping"
Dim Http As New MSXML2.XMLHTTP60, Html As New HTMLDocument
Dim post As HTMLHtmlElement, R&, C&
With Http
.Open "GET", URL, False
.send
Html.body.innerHTML = .responseText
End With
R = 1
For Each post In Html.getElementsByClassName("question-summary")
C = C + 1
Cells(R, C) = post.getElementsByClassName("views")(0).innerText
C = C + 1
Cells(R, C) = post.getElementsByClassName("votes")(0).innerText
Next post
End Sub

VBA/DOM - Get elements based on attribute

Excel 2013 on Windows 7. XPath/Javascript/jQuery is out of scope.
I am trying to iterate over select div elements in a page, namely elements that have a specific data-level attribute.
My current approach is similar to this, but I was unable to find a non-manual way to select elements based on attributes. The closest I came was something like:
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", url, False
.Send
pHTML.body.innerHTML = .ResponseText
End With
Set eCollection = pHTML.getElementsByClassName("chapter").getElementsByTagName("div")
For i = 0 To eCollection.Length
If eCollection(i).getAttribute("data-level") >= 0 Then ' Throw cake
Next i
This solution, while I am sure it is viable (if unelegant), seems sub-optimal if only for how big the loop is going to end up being when I start looking for specific elements and sequences of elements within these elements.
So I am looking for a way to do something like this:
For Each pElement In pHTML.getElementsByClassName("chapter").getElementsByTagName("div").getElementsByAttribute("data-level")
' Throw cake at the element
Next
I'm aware that there is no method getElementsByAttribute, hence the question.
Is there some approach here that I am blind to, or am I locked to manual iteration?
Alternatively, if I swap my current approach for creating an IE instance, รก la this answer, could I concievably use querySelectorAll to end up with something resembling the result I have outlined above?
For anyone else coming this way, the outer shell, so to speak, can look like this:
Sub ScrapeWithHTMLObj(url As String, domClassName As String, domTag As String, domAttribute As String, domAttributeValue As String)
' Dependencies:
' * Microsoft HTML Object Library
' Declare vars
Dim pHTML As HTMLDocument
Dim pElements As Object, pElement As Object
Set pHTML = New HTMLDocument
' Basic URL healthcheck
Do While (url = "" Or (Left(url, 7) <> "http://" And Left(url, 8) <> "https://"))
MsgBox ("Invalid URL!")
url = InputBox("Enter new URL: (0 to terminate)")
If url = "0" Then Exit Sub
Loop
' Fetch page at URL
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", url, False
.Send
pHTML.body.innerHTML = .ResponseText
End With
' Declare page elements
Set pElements = pHTML.getElementsByClassName(domClassName)
Set pElement = pElements(0).getElementsByTagName(domTag)
' Extract only elements with wanted attribute
pEleArray = getElementsByAttribute(pElement, domAttribute, domAttributeValue)
For Each e In pEleArray
' Do stuff to elements
Debug.Print e.getAttribute(domAttribute)
Next
End Sub
If you go this route, you'll also need something like this:
Function getElementsByAttribute(pObj As Object, domAttribute As String, domAttributeValue As String) As Object()
Dim oTemp() As Object
ReDim oTemp(1 To 1)
For i = 0 To pObj.Length - 1
'Debug.Print pObj(i).getAttribute(domAttribute)
If pObj(i).getAttribute(domAttribute) = domAttributeValue Then
Set oTemp(UBound(oTemp)) = pObj(i)
ReDim Preserve oTemp(1 To UBound(oTemp) + 1)
End If
Next i
ReDim Preserve oTemp(1 To UBound(oTemp) - 1)
getElementsByAttribute = oTemp
End Function
Depending on the HTML tree, you'll need to change which elements you zero in on in the sub, obviously. For the site I used in testing, this structure worked flawlessly.
Example usage:
Call ScrapeWithHTMLObj("https://somesite", "chapter-index", "div", "data-level", "1")
It will enter the first class named chapter-index, select all elements with the div tag, and finally extract all elements containing the attribute data-level with value 1.

How to identify string in htm.getelementbyid("mystring") using vba?

I am trying to get data from a different website using the vba code bellow, but I don't know how to identify the string inside the parenthesis in this statement "With htm.getelementbyid("comps-results"). How do I get the string in the parenthesis from, for example, this website
I would appreciate very much if someone could help me on this matter.
Thank you in advance.
Sub GetData()
Dim x As Long, y As Long
Dim htm As Object
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "http://www.zillow.com/homes/comps/67083361_zpid/", False
.send
htm.body.innerhtml = .responsetext
End With
With htm.getelementbyid("comps-results")
For x = 0 To .Rows.Length - 1
For y = 0 To .Rows(x).Cells.Length - 1
Sheets(1).Cells(x + 1, y + 1).Value = .Rows(x).Cells(y).innertext
Next y
Next x
End With
End Sub
The getElementByID method takes a unique ID as an argument and returns a single HTML element if there is one with such an ID value.
Probably what you need to do is use the getElementsByTagName method, which returns a collection of matching elements. Since this may result in multiple matches, I find it best to create an object first, and an iterator variable:
Dim compresults
Dim el
Set compresults = htm.getelementsbytagname("comps-results")
For each el in compresults
MsgBox el.InnerText
Next
BTW, I am fairly certain ( but have not verified) that an HTMLElementCollection does not have a .Rows member, so the next line in your code will probably raise an error. Likewise, the .Rows does not have a .Length property, so there's at least two errors on that single line of code AND in the next line, note that .Cells does not have a .Length member, either, so another error.
For assistance with those parts of your code, I urge you to ask a new question. This answer addresses your original question.