Trouble scraping the names from a certain website - vba

I've come across such a webpage which seems to me a bit misleading
to scrape. When I go the address "https://jobboerse2.arbeitsagentur.de/jobsuche/?s=1" it takes me to a page with "suchen" option. After clicking "suchen" it opens a new layout within this tab and takes me to a page with lots of names. So, the site address is same again "https://jobboerse2.arbeitsagentur.de/jobsuche/?s=1".
I would like to scrape the names of that page, as in "Mitarbeiter für die Leerguttrennung (m/w)". Any help would be highly appreciated.
What I wrote so far:
Sub WebData()
Dim http As New MSXML2.xmlhttp60
Dim html As New htmldocument, source As Object, item As Object
With http
.Open "GET", "https://jobboerse2.arbeitsagentur.de/jobsuche/?s=1", False
.send
html.body.innerHTML = .responseText
End With
Set source = html.getElementsByClassName("ng-binding ng-scope")
For Each item In source
x = x + 1
Cells(x, 1) = item.innerText
Next item
Set html = Nothing: Set source = Nothing
End Sub
The links are incremented like these as per xhr in developer tool but can't figure out what is the number of the last link.
"https://jobboerse2.arbeitsagentur.de/jobsuche/pc/v1/jobs"
"https://jobboerse2.arbeitsagentur.d...00&FCT.ANGEBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS"
"https://jobboerse2.arbeitsagentur.d...EBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS&offset=12"
"https://jobboerse2.arbeitsagentur.d...EBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS&offset=24"
"https://jobboerse2.arbeitsagentur.d...EBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS&offset=36"

Related

Get Data using MSXML2.XMLHTTP

i am trying to get data using MSXML2.XMLHTTP
but it didn't work
any ideas?
Sub getdata
Dim request As Object
Dim response As String
Dim html As New HTMLDocument
Dim website As String
Dim price As String
Dim sht As Worksheet
Application.DisplayAlerts = False
Set sht = ActiveSheet
On Error Resume Next
website = "https://shopee.co.id/AFI-EC-Tshirt-Yumia-(LD-90-P-57)-i.10221730.5568491283"
Set request = CreateObject("MSXML2.XMLHTTP")
request.Open "GET", website, False
request.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
request.send
response = StrConv(request.responseBody, vbUnicode)
html.DocumentElement.innerHTML = response
price = html.querySelector("div.AJyN7v")(0).innerText
Debug.Print price
Application.StatusBar = ""
On Error GoTo 0
Application.DisplayAlerts = True``
End Sub
I have done many ways but still not working ,
hope someone can help me
Pretty much everything on that page requires javascript to load. Javascript doesn't run with xmlhttp request to landing page so price never gets retrieved..
The price is being retrieved dynamically from an additional API call returning json.
If you examine the url you will have the following:
https://shopee.co.id/AFI-EC-Tshirt-Yumia-(LD-90-P-57)-i.10221730.5568491283
The last set of consecutive numbers is the product id i.e. 5568491283.
If you open the network tab of dev tools F12, and press F5 to refresh the web traffic that updates the page, then check on the xhr only traffic, then input your product id into the search box, the first result retrieved is the xhr which is returning the price:
https://shopee.co.id/api/v2/item/get?itemid=5568491283&shopid=10221730
The response is json so you will need a json parser to extract the result (or use regex on string - less preferable)
In the headers sub-tab you can view info about the xhr request made.
Check the terms and conditions to see if scraping allowed and also whether there is an public API for retrieving this data.

Import web source code including not displayed on page

I want to import the web page source code in excel what I see using View Page Source option in Chrome. But when I import it using below code, it doesn't import all content. The values that I'm looking for do not get displayed on web page.
I'm also unable to locate the element using getElementsByClassName or other methods.
Private Sub HTML_VBA_Excel()
Dim oXMLHTTP As Object
Dim sPageHTML As String
Dim sURL As String
'Change the URL before executing the code
sURL = "http://pntaconline/getPrDetails?entry=8923060"
'Extract data from website to Excel using VBA
Set oXMLHTTP = CreateObject("MSXML2.ServerXMLHTTP")
oXMLHTTP.Open "GET", sURL, False
oXMLHTTP.send
sPageHTML = oXMLHTTP.responseText
'Get webpage data into Excel
' If longer sourcecode mean, you need to save to a external text file or somewhere,
' since excel cell have some limits on storing max characters
ThisWorkbook.Sheets(1).Cells(1, 1) = sPageHTML
MsgBox "XMLHTML Fetch Completed"
End Sub
Data I want to import is IDs and Name:
So you need to understand the DOM in order to realize why this isnt loading everything.
XMLHTTP is going to load that specific resource you requested. A lot of web pages, sorry pretty much all web pages, load extra resources after the initial request is done.
If you're missing stuff, it's probably loaded on a different network request. So open up your DevTools in Chrome, make sure Network tab is recording, and watch how many network requests go in and out when you load your target page.
Essentially, this if you're using XMLHTTP, you'd have to simulate each of those to get the requests you want to scrape.
EDIT
So you're just kind of pasting the data response into Excel.
Better to create HTMLDocument variable then set the response from XMLHTTP to be the response like here: https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms762275(v=vs.85)
set xmlhttp = new ActiveXObject("Msxml2.XMLHTTP.3.0");
xmlhttp.open("GET", "http://localhost/books.xml", false);
xmlhttp.send();
Debug.print(xmlhttp.responseText);
Dim xString as String
xSring = xmlhttp.responseText
'search the xString variable
You can then split that response for the sheet or search it and extract the values in VBA memory, rather than print to the sheet.
You could also set the xString responseText as the innerHTML for a new HTMLDocument variable
Dim xHTML as HTMLDocument
Set xHTML.innertext = xString

How to shake off duplicate links while parsing web-data?

I've written some script in vba to parse the links leading to the next page from a torrent site. My script is able to scrape them. However, the issue I'm facing is that couple of duplicate links coming along in the result. My question is whether there is any technique with which I can parse only the unique links?
Sub TorrentData()
Dim http As New XMLHTTP60, html As New HTMLDocument, post As Object
With http
.Open "GET", "https://yts.ag/browse-movies", False
.send
html.body.innerHTML = .responseText
End With
For Each post In html.getElementsByClassName("tsc_pagination")(0).getElementsByTagName("a")
If InStr(post, "page") > 0 Then
x = x + 1: Cells(x, 1) = post.href
End If
Next post
End Sub
Partial picture of the scraped links:
Be sure to check the link before proceeding:
"https://www.dropbox.com/s/647x3m65u90a1bu/Description1.txt?dl=0"
I couldn't make the site work. Anyways, the proper way to use dictionary to eliminate duplicates and write to cells inside the same loop should look something like this:
For Each Post In html.getElementsByClassName("tsc_pagination")(0).getElementsByTagName("a")
If InStr(Post.href, "page") > 0 Then
If Not dict.Exists(Post.href) Then
dict.Add Post.href, "whatever information you would like to store"
x = x + 1
Cells(x, 1) = Post.href
End If
End If
Next Post

How to login to a website using vba?

I am trying to login the mentioned website using vba but unable to get through.kindly assist. The website i am trying to logging contains a form where we are suppose to fill our credentials.though, the website is opening using the below enlisted code but nothing happens post that.
Dim HTMLDoc as HTMLDocument
Dim My Browser as Internet Explorer
Sub MYRED()
Dim MyHTML_Element As IHTMLElement
Dim MyURL as String
On Error GoTo Err_Clear
MyURL = "https://www.Markit.com"
Set MyBrowser = New InternetExplorerMedium
MyBrowser.Silent = True
MyBrowser.navigate MyURL`enter code here`
MyBrowser. Visible = True`enter code here`
Do
Loop Until MyBrowser.readyState = READYSTATE_COMPLETE
Set HTMLDoc = MyBrowser. Document
'Useridele = "input_5ac52c16-cbba-4fb4-b284-857fac1f55fd"
MyBrowser.document.getElementById ("input_5ac52c16-cbba-4fb4-b284- 857fac1f55fd").Value = "user#example.com"
'HTMLDoc.all.input_5ac52c16-cbba-4fb4-b284-857fac1f55fd.Value = "atul.sanwal#markit.com" 'Enter your email id here
passele = "input_b14b8d03-75b6-49b5-a61a-602672036046"
MyBrowser.document.getElementById("input_b14b8d03-75b6-49b5-a61a- 602672036046").Value = "***************"
'HTMLDoc.all.passele.Value = "12345678" 'Enter your password here
For Each My Html_Element in HTMLDoc.getElementsByTagName("input")
If MyHTML_Element.Type = "submit" Then MyHTML_Element.Click: Exit For
Next
Err_Clear:
If Err <> 0 Then
Err.Clear
Resume Next
End If
End Sub
As I can see this:
"input_5ac52c16-cbba-4fb4-b284-857fac1f55fd")
I'm guessing that this ID is generated every time you are making request
(yep, everytime I'm refreshing this site this element got new ID ), so
You cannot pass static ID as key to get elementById You should try other options to get this specific element You want to fill with data.
Maybe Your library supports gettibng by attribute so, You can get element with atribute
name='username'
One additional proposition :
You are using html to make request via web Interface, but You could try to ommit loging in just to make pure requests.
As I see when filling data and pushing Submit my browser is sending
POST request :
POST /home/UFEExternalSignOnServlet HTTP/1.1
Host: products.markit.com
I dont know how large is your knowledge about WebAPI and etc. but You can write to me if need more help

Using MSXML to fetch data from website

I am trying to use the following code to geocode a bunch of cities from this website: mygeoposition.com but there seems to be some sort of issue and the variable 'Lat' in the following code always returns empty:
Sub Code()
Dim IE As MSXML2.XMLHTTP60
Set IE = New MSXML2.XMLHTTP60
IE.Open "GET", "http://mygeoposition.com/?q=Chuo-ku, Osaka", False
IE.send
While IE.ReadyState <> 4
DoEvents
Wend
Dim HTMLDoc As MSHTML.HTMLDocument
Dim htmlBody As MSHTML.htmlBody
Set HTMLDoc = New MSHTML.HTMLDocument
Set htmlBody = HTMLDoc.body
htmlBody.innerHTML = IE.responseText
Lat = HTMLDoc.getElementById("geodata-lat").innerHTML
IE.abort
End Sub
I have another code that uses the browser to do the same thing and it works fine with that but it gets quite slow. When I use this code with MSXML, it just doesn't work. Apologies I am a newbie with using VBA for data extraction from website. Please help.
The response contains no content in the geodata-lat element. It appears that client side code is getting that data so your response only is looking at the html that the server generated. I tried this out myself and this is the section of the response you are looking for. You can see it is empty:
If you try an element that has content (geodata-kml-button), it does pull in a value ("Download KML file"). Ignore the ByteArrayToString() call, that was just me testing:
If they don't have a real API then I don't think you can get your data this way.