How to shake off duplicate links while parsing web-data? - vba

I've written some script in vba to parse the links leading to the next page from a torrent site. My script is able to scrape them. However, the issue I'm facing is that couple of duplicate links coming along in the result. My question is whether there is any technique with which I can parse only the unique links?
Sub TorrentData()
Dim http As New XMLHTTP60, html As New HTMLDocument, post As Object
With http
.Open "GET", "https://yts.ag/browse-movies", False
.send
html.body.innerHTML = .responseText
End With
For Each post In html.getElementsByClassName("tsc_pagination")(0).getElementsByTagName("a")
If InStr(post, "page") > 0 Then
x = x + 1: Cells(x, 1) = post.href
End If
Next post
End Sub
Partial picture of the scraped links:
Be sure to check the link before proceeding:
"https://www.dropbox.com/s/647x3m65u90a1bu/Description1.txt?dl=0"

I couldn't make the site work. Anyways, the proper way to use dictionary to eliminate duplicates and write to cells inside the same loop should look something like this:
For Each Post In html.getElementsByClassName("tsc_pagination")(0).getElementsByTagName("a")
If InStr(Post.href, "page") > 0 Then
If Not dict.Exists(Post.href) Then
dict.Add Post.href, "whatever information you would like to store"
x = x + 1
Cells(x, 1) = Post.href
End If
End If
Next Post

Related

Safe way to web scrape within VBA

I need to scrape some price data from a website. To get this done, I use the following snippet:
With http
.Open "GET", url, False
.send
html.body.innerHTML = .responseText
End With
Set topics = html.getElementsByClassName("sidebar-item-label")
For i = 1 To topics.Length - 1
str = topics(i).href
It works, but I am wondering how to secure the data, before assigning the html response to my variable str. To avoid malicious code get run on my windows machine, I need to validate, sanatize and escape the response data, before safe it to the variable and do the further stuff like string splitting and save it into my Excel spreadsheet.
Does anybody can help me with that? Do you need more informations?
My code so far:
Dim objXmlHTTP as New XMLHTTP60, html as New HTMLDocument
Dim prices as IHTMLElementCollection
With objXmlHTTP
.open "GET", URL, FALSE
.send
html.body.innerHTML = .responseText
End With
Set prices = html.getElementsByClassName("MyClass")
After getting the data, I am looping through the collection and search for a specific string "price" and assign the data t/value of price to an Excel cell. So and to avoid that excel will execute any bad code, I want to validate, sanatize and escape the data, before saving the html data into Excel cell.

Catch the POST Request Response and the Redirected URL from XMLHTTP request with VBA

I'm trying to catch a Response to a POST Request using XMLHTTP using the code below
Dim XMLPage As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim htmlEle1 As MSHTML.IHTMLElement
Dim htmlEle2 As MSHTML.IHTMLElement
Dim URL As String
Dim elemValue As String
URL = "https://www.informadb.pt/pt/pesquisa/?search=500004064"
XMLPage.Open "GET", URL, False
XMLPage.send
HTMLDoc.body.innerHTML = XMLPage.responseText
For Each htmlEle1 In HTMLDoc.getElementsByTagName("div")
Debug.Print htmlEle1.className
If htmlEle1.className = "styles__SCFileModuleFooter-e6rbca-1 kUUNkj" Then
elemValue = Trim(htmlEle1.innerText)
If InStr(UCase$(elemValue), "CONSTITU") > 0 Then
'Found Value
Exit For
End If
End If
Next htmlEle1
The problem is that I can't find the ClassName "styles__SCFileModuleFooter-e6rbca-1 kUUNkj", because I notice that when I manually insert the value (500004064) in the search box of the URL : https://www.informadb.pt/pt/pesquisa/, the Web Page generates addicinal traffic and turns up at an end point URL : https://www.informadb.pt/pt/pesquisa/empresa/?Duns=453060832, where that className can be found in the Request ResponseText.
My goal is to use the First URL to retrieve the Duns number '453060832', to be able to access the information in the ResponseText of the EndPoint URL. And to catch Duns Number, I need to find a way to get the Endpoint URL, or try to get The POST request response below, and get that value using JSON parser:
{'TotalResults': 1,
'NumberOfPages': 1,
'Results': [{'Duns': '453060832',
'Vat': '500004064',
'Name': 'A PANIFICADORA CENTRAL EBORENSE, S.A.',
'Address': 'BAIRRO DE NOSSA SENHORA DO CARMO,',
'Locality': 'ÉVORA',
'OfficeType': 'HeadOffice',
'FoundIn': None,
'Score': 231.72766,
'PageUrl': '/pt/pesquisa/empresa/?Duns=453060832'}]}
I'm not being able to capture what is really happening using the XMLHTTP Browser request, that seems to be the below steps:
navigate to https://www.informadb.pt/pt/pesquisa/?search=500004064
Webpage generates additional traffic
Amongst that additional traffic is an API POST XHR request which
returns search results as JSON. That request goes to
https://www.informadb.pt/Umbraco/Api/Search/Companies and includes
the 500004064 identifier amongst the arguments within the post body
Based on the API results the browser ends up at the following URI
https://www.informadb.pt/pt/pesquisa/empresa/?Duns=453060832
Can someone help me please, I have to do it using VBA.
Thanks in advance.
A small example how to POST data to your website using VBA, and how to use bare-bones string processing to extract data from the result, as outlined in my comments above.
Function GetVatId(dunsNumber As String) As String
With New MSXML2.XMLHTTP60
.open "POST", "https://www.informadb.pt/Umbraco/Api/Search/Companies", False
.setRequestHeader "Content-Type", "application/json"
.send "{""Page"":0,""PageSize"":5,""SearchTerm"":""" & dunsNumber & """,""Filters"":[{""Key"":""districtFilter"",""Name"":""Distrito"",""Values"":[]},{""Key"":""legalFormFilter"",""Name"":""Forma Jurídica"",""Values"":[]}],""Culture"":""pt""}"
If .status = 200 Then
MsgBox "Response: " & .responseText, vbInformation
GetVatId = Mid(.responseText, InStr(.responseText, """Vat"":""") + 7, 9)
Else
MsgBox "Repsonse status " & .status, vbExclamation
End If
End With
End Function
Usage:
Dim vatId As String
vatId = GetVatId("453060832") ' => "500004064"
For a more robust solution, you should use a JSON parser and -serializer, something like https://github.com/VBA-tools/VBA-JSON.

Trouble scraping the names from a certain website

I've come across such a webpage which seems to me a bit misleading
to scrape. When I go the address "https://jobboerse2.arbeitsagentur.de/jobsuche/?s=1" it takes me to a page with "suchen" option. After clicking "suchen" it opens a new layout within this tab and takes me to a page with lots of names. So, the site address is same again "https://jobboerse2.arbeitsagentur.de/jobsuche/?s=1".
I would like to scrape the names of that page, as in "Mitarbeiter für die Leerguttrennung (m/w)". Any help would be highly appreciated.
What I wrote so far:
Sub WebData()
Dim http As New MSXML2.xmlhttp60
Dim html As New htmldocument, source As Object, item As Object
With http
.Open "GET", "https://jobboerse2.arbeitsagentur.de/jobsuche/?s=1", False
.send
html.body.innerHTML = .responseText
End With
Set source = html.getElementsByClassName("ng-binding ng-scope")
For Each item In source
x = x + 1
Cells(x, 1) = item.innerText
Next item
Set html = Nothing: Set source = Nothing
End Sub
The links are incremented like these as per xhr in developer tool but can't figure out what is the number of the last link.
"https://jobboerse2.arbeitsagentur.de/jobsuche/pc/v1/jobs"
"https://jobboerse2.arbeitsagentur.d...00&FCT.ANGEBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS"
"https://jobboerse2.arbeitsagentur.d...EBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS&offset=12"
"https://jobboerse2.arbeitsagentur.d...EBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS&offset=24"
"https://jobboerse2.arbeitsagentur.d...EBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS&offset=36"

Crawl to final URL from within Excel VBA

I have a list of domain names, and many of them redirect me to the same domain. For instance... foo1.com, foo2.csm and foo3.com all deposit me at foo.com.
I'm trying to deduplicate the list of domains by writing a VBA script to load the final page and extract it's URL.
I've started from this article which retrieve's the page's title (http://www.excelforum.com/excel-programming-vba-macros/355192-can-i-import-raw-html-source-code-into-excel.html), but can't figure out how to modify it to get the final URL (from which I can extract the URL.
Can anyone please point me in the right direction?
Add a reference to "Microsoft XML, v3.0" (or whatever version you have)
Sub tester()
Debug.Print CheckRedirect("adhpn2.com")
End Sub
Function CheckRedirect(URL As String)
If Not UCase(URL) Like "HTTP://*" Then URL = "http://" & URL
With New msxml2.ServerXMLHTTP40
.Open "HEAD", URL, False
.send
CheckRedirect = .getOption(-1)
End With
End Function
Try this, need to look at .LocationURL:
Public Function gsGetFinalURL(rsURL As String) As String
Dim ie As Object
Set ie = CreateObject("InternetExplorer.Application")
With ie
.navigate rsURL
Do While .Busy And Not .ReadyState = 4
DoEvents
Loop
gsGetFinalURL = .LocationURL
.Quit
End With
Set ie = Nothing
End Function
I haven't tried it on a huge variety of URLs, just the one you gave and a couple of others. If it is an invalid URL it will return what is passed. You can use the code from the original function to check and handle accordingly.

Excel with VBA - XmlHttp to use div

I am using excel with VBA to open a page and extract some information and putting it in my database. After some research, I figured out that opening IE obviously takes more time and it can be achieved using XmlHTTP. I am using the XmlHTTP to open a web page as proposed in my another question. However, while using IE I was able to navigate through div tags. How can I accomplish the same in XmlHTTP?
If I use IE to open the page, I am doing something like below to navigate through multiple div elements.
Set openedpage1 = iedoc1.getElementById("profile-experience").getElementsbyClassName("title")
For Each div In openedpage1
---------
However, with XmlHttp, I am not able to do like below.
For Each div In html.getElementById("profile-experience").getElementsbyClassName("title")
I am getting an error as object doesn't support this property or method.
Take a look at this answer that I had posted for another question as this is close to what you're looking for. In summary, you will:
Create a Microsoft.xmlHTTP object
Use the xmlHTTP object to open your url
Load the response as XML into a DOMDOcument object
From there you can get a set of XMLNodes, select elements, attributes, etc. from the DOMDocument
The XMLHttp object returns the contents of the page as a string in responseText. You will need to parse this string to find the information you need. Regex is an option but it will be quite cumbersome.
This page uses string functions (Mid, InStr) to extract information from the html-text.
It may be possible to create a DOMDocument from the retreived HTML (I believe it is) but I haven't pursued this.
As mentioned with the answers above put the .responseText into an HTMLDocument and then work with that object e.g.
Option Explicit
Public Sub test()
Dim html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", "http://www.someurl.com", False
.send
html.body.innerHTML = .responseText
End With
Dim aNodeList As Object, iItem As Long
Set aNodeList = html.querySelectorAll("#profile-experience.title")
With ActiveSheet
For iItem = 0 To aNodeList.Length - 1
.Cells(iItem + 1, 1) = aNodeList.item(iItem).innerText
'.Cells(iItem + 1, 1) = aNodeList(iItem).innerText '<== or potentially this syntax
Next iItem
End With
End Sub
Note:
I have literally translated your getElementById("profile-experience").getElementsbyClassName("title") into a CSS selector, querySelectorAll("#profile-experience.title"), so assume that you have done that correctly.