How to get all the links leading to the next page? - vba

I've written some code in vba to get all the links leading to the next page from a webpage. The highest number of next page links is 255. Running my script, I get all the links within 6906 links. That means the loop runs again and again and I'm overwriting stuffs. Filtering out duplicate links I could see that 254 unique links are there. My objective here is not to hardcode the highest page number to the link for iteration. Here is what I'm trying with:
Sub YifyLink()
Const link = "https://www.yify-torrent.org/search/1080p/"
Dim http As New XMLHTTP60, html As New HTMLDocument, htm As New HTMLDocument
Dim x As Long, y As Long, item_link as String
With http
.Open "GET", link, False
.send
html.body.innerHTML = .responseText
End With
For Each post In html.getElementsByClassName("pager")(0).getElementsByTagName("a")
If InStr(post.innerText, "Last") Then
x = Split(Split(post.href, "-")(1), "/")(0)
End If
Next post
For y = 0 To x
item_link = link & "t-" & y & "/"
With http
.Open "GET", item_link, False
.send
htm.body.innerHTML = .responseText
End With
For Each posts In htm.getElementsByClassName("pager")(0).getElementsByTagName("a")
I = I + 1: Cells(I, 1) = posts.href
Next posts
Next y
End Sub
Elements within which the links are:
<div class="pager">1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Next Last </div>
The results I'm getting (partial portion):
about:/search/1080p/t-20/
about:/search/1080p/t-21/
about:/search/1080p/t-22/
about:/search/1080p/t-23/
about:/search/1080p/t-255/

The idea should be to scrape pages in a loop and find something to compare, if not true, then exit loop.
This might be, i.e. checking the key against a dictionary, or checking if elements exits, or any other logic that might be specific to your problem.
For example, here your problem is, the site keeps displaying page 255 for the latter pages. So this is a clue for us. We can compare an element that belongs to page (n) with an element that belongs to page (n-1).
For instance, if element in page 256 is the same as element in page 255, then exit loop/sub. Please see the sample code below:
Sub yify()
Const mlink = "https://www.yify-torrent.org/search/1080p/t-"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object, posts As Object
Dim pageno As Long, rowno As Long
pageno = 1
rowno = 1
Do
With http
.Open "GET", mlink & pageno & "/", False
.send
html.body.innerHTML = .responseText
End With
Set posts = html.getElementsByClassName("mv")
If Cells(rowno, 1) = posts(17).getElementsByTagName("a")(0).innerText Then Exit Do
For Each post In posts
With post.getElementsByTagName("div")
If .Length Then
rowno = rowno + 1
Cells(rowno, 1) = .Item(0).innerText
End If
End With
Next post
Debug.Print "pageno: " & pageno & " completed."
pageno = pageno + 1
Loop
End Sub

Related

Can't arrange items to the expected position

I've witten a script in vba to parse two categories from each container from a webpage. The scraper is able to parse them accordingly. The problem I'm facing at this moment is that I can't place these items across columns. If a column contains views, the next column should contains votes and so on. The way I'm expecting the result is more like:
column1 column2 column3 column4
9 views 0 vote 10 views -2
This is my script so far:
Sub CollectInfo()
Const URL As String = "https://stackoverflow.com/questions/tagged/web-scraping"
Dim Http As New XMLHTTP60, Html As New HTMLDocument
Dim post As HTMLHtmlElement, R&, C&
With Http
.Open "GET", URL, False
.send
Html.body.innerHTML = .responseText
End With
R = 1
For Each post In Html.getElementsByClassName("question-summary")
C = C + 1: Cells(R, C) = post.getElementsByClassName("views")(0).innerText
Cells(R, C + 1) = post.getElementsByClassName("votes")(0).innerText
Next post
End Sub
The way I tried is definitely leading me to the wrong placing. How can I fix it to serve the purpose? Btw, I do not wish to go for the offset (I meant Range("A1").offset(,1)") looping ;rather, I wanna stick to the way I tried above. Thanks.
This will show views and votes by turns. I changed XMLHTTP60 to MSXML2.XMLHTTP60, because on my end it causes automation error.
Sub CollectInfo()
Const URL As String = "https://stackoverflow.com/questions/tagged/web-scraping"
Dim Http As New MSXML2.XMLHTTP60, Html As New HTMLDocument
Dim post As HTMLHtmlElement, R&, C&
With Http
.Open "GET", URL, False
.send
Html.body.innerHTML = .responseText
End With
R = 1
For Each post In Html.getElementsByClassName("question-summary")
C = C + 1
Cells(R, C) = post.getElementsByClassName("views")(0).innerText
C = C + 1
Cells(R, C) = post.getElementsByClassName("votes")(0).innerText
Next post
End Sub

Got stuck scraping certain fields from a site

I've written a script in vba using which I can parse "Company Name", "Phone", "Fax" and "Email" from a specific site but in case of scraping "Address", "Web" and "Name" I got stuck. I've written the script using responsetext and split method in vba. Hope there is someone to show me a workaround.
Here is what i tried with:
str = Split(http.responseText, " class=""contact-details block dark"">")
y = UBound(str)
For i = 1 To y
Cells(x, 1) = Split(Split(str(i), "Company Name:")(1), "<")(0)
Cells(x, 2) = Split(Split(str(i), "Phone:")(1), "<")(0)
Cells(x, 3) = Split(Split(str(i), "Fax:")(1), "<")(0)
Cells(x, 4) = Split(Split(str(i), "mailto:")(1), ">")(0)
x = x + 1
Next i
Here goes the html element stuff:
<div class="contact-details block dark">
<h3>Contact Details</h3><p>Company Name: PPEHeads Australia<br>Phone: +61 2 9824 5520<br>Fax: +61 2 9824 5526<br>Web: <a target="_blank" href="http://www.ppeheads.com.au">http://www.ppeheads.com.au</a></p><h4>Address</h4><p>Unit 2 / 4 Reaghs Farm Road<br>MINTO<br>NSW<br>2566</p><h4>Contact</h4><p>Name: Alan Hadfield<br>Phone: +61 2 9824 5520<br>Fax: +61 2 9824 5526<br>Email: alan#ppeheads.com.au</p>
</div>
Please provide the rest of your code next time, because the problem might not be where you think it is. Luckily I found your previous post here
If you take a closer look there are 3 p tags within your html element:
1st one is for Contact Company Details which you can get by
Set ele = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(0)
2nd one is for Address Details which you can get by
Set ele2 = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(1)
3rd one is for Contact Person Details which you can get by
Set ele3 = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(2)
Notice (0), (1), (2) changes at the end of code which gives you the appearance order of p tag.
I amended your previous code and commented the changes so you can see the difference:
Sub RestData()
Dim http As New MSXML2.XMLHTTP60
Dim html As New HTMLDocument
Dim ele, ele2, ele3 As Object, post As Object 'declare
Dim TypeDetails() As String
Dim TypeDetails3() As String 'declare
Dim TypeDetail() As String
Dim i As Long, r As Long
With CreateObject("MSXML2.serverXMLHTTP")
.Open "GET", "http://www.austrade.gov.au/SupplierDetails.aspx?ORGID=ORG0120000508&folderid=1736", False
.send
html.body.innerHTML = .responseText
End With
'get all the p elements
Set ele = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(0)
Set ele2 = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(1)
Set ele3 = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(2)
r = 2
'split from line feed
TypeDetails() = Split(ele.innerText, Chr(10))
TypeDetails3() = Split(ele3.innerText, Chr(10))
'This part goes for Contact Company Details, notice the operator is ": ",not ":"
For i = 0 To UBound(TypeDetails())
TypeDetail() = Split(TypeDetails(i), ": ")
Cells(r, 1) = VBA.Trim(TypeDetail(0))
Cells(r, 2) = VBA.Trim(TypeDetail(1))
r = r + 1
Next i
'This part goes for Address Details, replaced new line with " " for it to be in the same line
Cells(r, 1) = "Address"
Cells(r, 2) = Replace(ele2.innerText, vbLf, " ")
r = r + 1
'This part goes for Contact Person Details
For i = 0 To UBound(TypeDetails3())
TypeDetail() = Split(TypeDetails3(i), ": ")
Cells(r, 1) = VBA.Trim(TypeDetail(0))
Cells(r, 2) = VBA.Trim(TypeDetail(1))
r = r + 1
Next i
Set html = Nothing: Set ele = Nothing: Set docs = Nothing
End Sub
I hope this helps

VBA/DOM - Get elements based on attribute

Excel 2013 on Windows 7. XPath/Javascript/jQuery is out of scope.
I am trying to iterate over select div elements in a page, namely elements that have a specific data-level attribute.
My current approach is similar to this, but I was unable to find a non-manual way to select elements based on attributes. The closest I came was something like:
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", url, False
.Send
pHTML.body.innerHTML = .ResponseText
End With
Set eCollection = pHTML.getElementsByClassName("chapter").getElementsByTagName("div")
For i = 0 To eCollection.Length
If eCollection(i).getAttribute("data-level") >= 0 Then ' Throw cake
Next i
This solution, while I am sure it is viable (if unelegant), seems sub-optimal if only for how big the loop is going to end up being when I start looking for specific elements and sequences of elements within these elements.
So I am looking for a way to do something like this:
For Each pElement In pHTML.getElementsByClassName("chapter").getElementsByTagName("div").getElementsByAttribute("data-level")
' Throw cake at the element
Next
I'm aware that there is no method getElementsByAttribute, hence the question.
Is there some approach here that I am blind to, or am I locked to manual iteration?
Alternatively, if I swap my current approach for creating an IE instance, รก la this answer, could I concievably use querySelectorAll to end up with something resembling the result I have outlined above?
For anyone else coming this way, the outer shell, so to speak, can look like this:
Sub ScrapeWithHTMLObj(url As String, domClassName As String, domTag As String, domAttribute As String, domAttributeValue As String)
' Dependencies:
' * Microsoft HTML Object Library
' Declare vars
Dim pHTML As HTMLDocument
Dim pElements As Object, pElement As Object
Set pHTML = New HTMLDocument
' Basic URL healthcheck
Do While (url = "" Or (Left(url, 7) <> "http://" And Left(url, 8) <> "https://"))
MsgBox ("Invalid URL!")
url = InputBox("Enter new URL: (0 to terminate)")
If url = "0" Then Exit Sub
Loop
' Fetch page at URL
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", url, False
.Send
pHTML.body.innerHTML = .ResponseText
End With
' Declare page elements
Set pElements = pHTML.getElementsByClassName(domClassName)
Set pElement = pElements(0).getElementsByTagName(domTag)
' Extract only elements with wanted attribute
pEleArray = getElementsByAttribute(pElement, domAttribute, domAttributeValue)
For Each e In pEleArray
' Do stuff to elements
Debug.Print e.getAttribute(domAttribute)
Next
End Sub
If you go this route, you'll also need something like this:
Function getElementsByAttribute(pObj As Object, domAttribute As String, domAttributeValue As String) As Object()
Dim oTemp() As Object
ReDim oTemp(1 To 1)
For i = 0 To pObj.Length - 1
'Debug.Print pObj(i).getAttribute(domAttribute)
If pObj(i).getAttribute(domAttribute) = domAttributeValue Then
Set oTemp(UBound(oTemp)) = pObj(i)
ReDim Preserve oTemp(1 To UBound(oTemp) + 1)
End If
Next i
ReDim Preserve oTemp(1 To UBound(oTemp) - 1)
getElementsByAttribute = oTemp
End Function
Depending on the HTML tree, you'll need to change which elements you zero in on in the sub, obviously. For the site I used in testing, this structure worked flawlessly.
Example usage:
Call ScrapeWithHTMLObj("https://somesite", "chapter-index", "div", "data-level", "1")
It will enter the first class named chapter-index, select all elements with the div tag, and finally extract all elements containing the attribute data-level with value 1.

Pull data from Website into VBA

This might fall under the dumb question from a newbie. But I honestly don't know where to start in VBA. I tried a few different approaches on the web trying to pull data from the site I'm trying to and all of them failed miserably. Can someone help me (more or less show me) how to pull the data from this website?
https://rotogrinders.com/projected-stats/nfl?site=fanduel
It wouldn't even let me do the data->import. here is what I have so far. I keep getting stuck on line For t = 0 To (Table.Length - 1).
Sub test1()
Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
Dim Table As Object
Dim t As Integer
Dim r As Integer
Dim c As Integer
With appIE
.Navigate "https://rotogrinders.com/projected-stats/nfl?site=fanduel"
.Visible = True
End With
Do While appIE.Busy
DoEvents
Loop
Set Table = appIE.document.getElementById("proj-stats")
For t = 0 To (Table.Length - 1)
For r = 0 To (Table(t).Rows.Length - 1)
For c = 0 To (Table(t).Rows(r).Cells.Length - 1)
ThisWorkbook.Worksheets(1).Cells(r + 1, c + 1) = Table(t).Rows(r).Cells(c).innerText
Next c
Next r
Next t
appIE.Quit
Set appIE = Nothing
End Sub
You are close, and there are several ways to get the data. I chose to extract all row elements (HTML <TD>) and step through a simple loop. Since there are six columns I'm using two variables (r & c for row and column) to offset the data to format correctly.
Set Table = appIE.document.getElementsbytagname("td")
r = 0
c = 0
For Each itm In Table
Worksheets(1).Range("A1").Offset(r, c).Value = itm.innertext
c = c + 1
If c Mod 6 = 0 Then
r = r + 1
c = 0
End If
Next itm
Example Result:
One last note, sometimes the browser didn't finish loading before the script went on... I cheated by using a break point before the loop, waited until it loaded, then hit F5 to continue execution of code to ensure it would alway run.

Generate Webpage Titles from List of URLs in Excel

I have a column in Excel of website URLs and would like to automatically retrieve the webpage titles for each of those URLs and place them in a second corresponding column.
I don't have any experience with VBA, but would like to use VBA to do this. Can someone please list the steps one-by-one to get this done? What code do I paste and where? How do I point the code to the right column to pick up the URLs and how do I tell it where to populate the results? How do I run the code?
I think this is a popular problem but many don't uses the existing documentation on how to address it because they have the same issue as me -- they don't know how to load and run the script.
Any help would be greatly appreciated! Please be as detailed as possible.
Adjust the range "A1:A10" to match your data....
Sub GetTitles()
Dim c As Range, url As String
For Each c In Range("A1:A10").Cells
url = Trim(c.Value)
If LCase(url) Like "http://*" Then
c.Offset(0, 1).Value = GetTitle(url)
End If
Next c
End Sub
Function GetTitle(sURL As String)
Dim title As String, res As String, pos1, pos2
Dim objHttp As Object
Set objHttp = CreateObject("MSXML2.ServerXMLHTTP")
objHttp.Open "GET", sURL, False
objHttp.Send ""
res = objHttp.ResponseText
pos1 = InStr(1, UCase(res), "<TITLE>")
pos2 = InStr(1, UCase(res), "</TITLE>")
title = "<not found>"
If pos1 > 0 And pos2 > 0 Then
pos1 = pos1 + Len("<TITLE>")
title = Mid(res, pos1, pos2 - pos1)
End If
GetTitle = title
End Function