I am working on webpage data extraction and my code is extracting data properly but code is extracting data of only first page.Actully webpage having feature of data loading when mouse is moving down i.e second page is loaded when mouse moved down and then third and fourth vice versa on mouse move down.I am using following code for data extraction and code is working fine but only for first page,Is this possible that all pages will be loaded and then i will start extraction.
URL: https://www.trendyol.com/kadin-spor-outdoor-x-g1-c104593?sst=MOST_RATED
Code
http.Open "GET", url, False
http.Send
html.body.innerHTML = http.ResponseText
html1 = html.body.innerHTML
Set tdata = html.getElementsByClassName("p-card-chldrn-cntnr")
For Each Item In tdata
href2 = Item.getElementsByTagName("a")
href2 = Replace(href2, "about:", "")
Related
I need to scrape some price data from a website. To get this done, I use the following snippet:
With http
.Open "GET", url, False
.send
html.body.innerHTML = .responseText
End With
Set topics = html.getElementsByClassName("sidebar-item-label")
For i = 1 To topics.Length - 1
str = topics(i).href
It works, but I am wondering how to secure the data, before assigning the html response to my variable str. To avoid malicious code get run on my windows machine, I need to validate, sanatize and escape the response data, before safe it to the variable and do the further stuff like string splitting and save it into my Excel spreadsheet.
Does anybody can help me with that? Do you need more informations?
My code so far:
Dim objXmlHTTP as New XMLHTTP60, html as New HTMLDocument
Dim prices as IHTMLElementCollection
With objXmlHTTP
.open "GET", URL, FALSE
.send
html.body.innerHTML = .responseText
End With
Set prices = html.getElementsByClassName("MyClass")
After getting the data, I am looping through the collection and search for a specific string "price" and assign the data t/value of price to an Excel cell. So and to avoid that excel will execute any bad code, I want to validate, sanatize and escape the data, before saving the html data into Excel cell.
i am trying to get data using MSXML2.XMLHTTP
but it didn't work
any ideas?
Sub getdata
Dim request As Object
Dim response As String
Dim html As New HTMLDocument
Dim website As String
Dim price As String
Dim sht As Worksheet
Application.DisplayAlerts = False
Set sht = ActiveSheet
On Error Resume Next
website = "https://shopee.co.id/AFI-EC-Tshirt-Yumia-(LD-90-P-57)-i.10221730.5568491283"
Set request = CreateObject("MSXML2.XMLHTTP")
request.Open "GET", website, False
request.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
request.send
response = StrConv(request.responseBody, vbUnicode)
html.DocumentElement.innerHTML = response
price = html.querySelector("div.AJyN7v")(0).innerText
Debug.Print price
Application.StatusBar = ""
On Error GoTo 0
Application.DisplayAlerts = True``
End Sub
I have done many ways but still not working ,
hope someone can help me
Pretty much everything on that page requires javascript to load. Javascript doesn't run with xmlhttp request to landing page so price never gets retrieved..
The price is being retrieved dynamically from an additional API call returning json.
If you examine the url you will have the following:
https://shopee.co.id/AFI-EC-Tshirt-Yumia-(LD-90-P-57)-i.10221730.5568491283
The last set of consecutive numbers is the product id i.e. 5568491283.
If you open the network tab of dev tools F12, and press F5 to refresh the web traffic that updates the page, then check on the xhr only traffic, then input your product id into the search box, the first result retrieved is the xhr which is returning the price:
https://shopee.co.id/api/v2/item/get?itemid=5568491283&shopid=10221730
The response is json so you will need a json parser to extract the result (or use regex on string - less preferable)
In the headers sub-tab you can view info about the xhr request made.
Check the terms and conditions to see if scraping allowed and also whether there is an public API for retrieving this data.
I want to import the web page source code in excel what I see using View Page Source option in Chrome. But when I import it using below code, it doesn't import all content. The values that I'm looking for do not get displayed on web page.
I'm also unable to locate the element using getElementsByClassName or other methods.
Private Sub HTML_VBA_Excel()
Dim oXMLHTTP As Object
Dim sPageHTML As String
Dim sURL As String
'Change the URL before executing the code
sURL = "http://pntaconline/getPrDetails?entry=8923060"
'Extract data from website to Excel using VBA
Set oXMLHTTP = CreateObject("MSXML2.ServerXMLHTTP")
oXMLHTTP.Open "GET", sURL, False
oXMLHTTP.send
sPageHTML = oXMLHTTP.responseText
'Get webpage data into Excel
' If longer sourcecode mean, you need to save to a external text file or somewhere,
' since excel cell have some limits on storing max characters
ThisWorkbook.Sheets(1).Cells(1, 1) = sPageHTML
MsgBox "XMLHTML Fetch Completed"
End Sub
Data I want to import is IDs and Name:
So you need to understand the DOM in order to realize why this isnt loading everything.
XMLHTTP is going to load that specific resource you requested. A lot of web pages, sorry pretty much all web pages, load extra resources after the initial request is done.
If you're missing stuff, it's probably loaded on a different network request. So open up your DevTools in Chrome, make sure Network tab is recording, and watch how many network requests go in and out when you load your target page.
Essentially, this if you're using XMLHTTP, you'd have to simulate each of those to get the requests you want to scrape.
EDIT
So you're just kind of pasting the data response into Excel.
Better to create HTMLDocument variable then set the response from XMLHTTP to be the response like here: https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms762275(v=vs.85)
set xmlhttp = new ActiveXObject("Msxml2.XMLHTTP.3.0");
xmlhttp.open("GET", "http://localhost/books.xml", false);
xmlhttp.send();
Debug.print(xmlhttp.responseText);
Dim xString as String
xSring = xmlhttp.responseText
'search the xString variable
You can then split that response for the sheet or search it and extract the values in VBA memory, rather than print to the sheet.
You could also set the xString responseText as the innerHTML for a new HTMLDocument variable
Dim xHTML as HTMLDocument
Set xHTML.innertext = xString
I've come across such a webpage which seems to me a bit misleading
to scrape. When I go the address "https://jobboerse2.arbeitsagentur.de/jobsuche/?s=1" it takes me to a page with "suchen" option. After clicking "suchen" it opens a new layout within this tab and takes me to a page with lots of names. So, the site address is same again "https://jobboerse2.arbeitsagentur.de/jobsuche/?s=1".
I would like to scrape the names of that page, as in "Mitarbeiter für die Leerguttrennung (m/w)". Any help would be highly appreciated.
What I wrote so far:
Sub WebData()
Dim http As New MSXML2.xmlhttp60
Dim html As New htmldocument, source As Object, item As Object
With http
.Open "GET", "https://jobboerse2.arbeitsagentur.de/jobsuche/?s=1", False
.send
html.body.innerHTML = .responseText
End With
Set source = html.getElementsByClassName("ng-binding ng-scope")
For Each item In source
x = x + 1
Cells(x, 1) = item.innerText
Next item
Set html = Nothing: Set source = Nothing
End Sub
The links are incremented like these as per xhr in developer tool but can't figure out what is the number of the last link.
"https://jobboerse2.arbeitsagentur.de/jobsuche/pc/v1/jobs"
"https://jobboerse2.arbeitsagentur.d...00&FCT.ANGEBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS"
"https://jobboerse2.arbeitsagentur.d...EBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS&offset=12"
"https://jobboerse2.arbeitsagentur.d...EBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS&offset=24"
"https://jobboerse2.arbeitsagentur.d...EBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS&offset=36"
I'm trying to scrape quotes of Moroccan stocks from this website using VBA :
http://www.casablanca-bourse.com/bourseweb/en/Negociation-History.aspx?Cat=24&IdLink=225
Where you select a security, check "By period", specify the date interval and finally click the "Submit" button.
I went first with the easy method : using an Internet Explorer object :
Sub method1()
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = False
IE.Navigate "http://www.casablanca-bourse.com/bourseweb/Negociation-Historique.aspx?Cat=24&IdLink=302"
Do While IE.Busy
DoEvents
Loop
'Picking the security
Set obj1 = IE.document.getElementById("HistoriqueNegociation1_HistValeur1_DDValeur")
obj1.Value = "4100 " 'Security code taken from the source html
'Specifying "By period"
Set obj2 = IE.document.getElementById("HistoriqueNegociation1_HistValeur1_RBSearchDate")
obj2.Checked = True
'Start date
Set obj3 = IE.document.getElementById("HistoriqueNegociation1_HistValeur1_DateTimeControl1_TBCalendar")
obj3.Value = "07/03/2016"
'End date
Set obj4 = IE.document.getElementById("HistoriqueNegociation1_HistValeur1_DateTimeControl2_TBCalendar")
obj4.Value = "07/03/2016"
'Clicking the button
Set objs = IE.document.getElementById("HistoriqueNegociation1_HistValeur1_Image1")
objs.Click
'Setting the data <div> as an object
Set obj5 = IE.document.getElementById("HistoriqueNegociation1_UpdatePanel1")
s = obj5.innerHTML
'Looping until the quotes pop up
Do While InStr(s, "HistoriqueNegociation1_HistValeur1_RptListHist_ctl01_Label3") = 0
Application.Wait DateAdd("s", 0.1, Now)
s = obj5.innerHTML
Loop
'Printing the value
Set obj6 = IE.document.getElementById("HistoriqueNegociation1_HistValeur1_RptListHist_ctl01_Label3")
Cells(1, 1).Value = CDbl(obj6.innerText)
IE.Quit
Set IE = Nothing
End Sub
This webpage being dynamic, I had to make the application wait, until the data pops up (until the data pops in the HTML code), and that's why I used that second Do while loop.
Now, what I want to do, is to use the harder way : sending the form request through VBA, which is pretty easy when it comes to GET requests, but this site uses a POST request that I found pretty hard to mimic in VBA.
I used this simple code :
Sub method2()
Set objHTTP = CreateObject("MSXML2.ServerXMLHTTP")
URL = "http://www.casablanca-bourse.com/bourseweb/Negociation-Historique.aspx?Cat=24&IdLink=302"
objHTTP.Open "POST", URL, False
objHTTP.setRequestHeader "Content-type", "application/x-www-form-urlencoded"
objHTTP.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
objHTTP.send ("encoded request params go here")
Cells(1, 1).Value = objHTTP.ResponseText
End Sub
I used the Chrome DevTools (F12) to record the POST request. But I had a hard time figuring what the params should be (The form data is too long, i couldn't make a screenshot or copy it here, so please feel free to record it yourself). I went with the only params that I needed (security code, the radiobox and the two dates), but the request response didn't match the DevTools one, and it didn't contain any usable. Here are the params that I used :
HistoriqueNegociation1$HistValeur1$DDValeur=9000%20%20&HistoriqueNegociation1$HistValeur1$historique=RBSearchDate&HistoriqueNegociation1$HistValeur1$DateTimeControl1$TBCalendar=07%2F03%2F2016&HistoriqueNegociation1$HistValeur1$DateTimeControl2$TBCalendar=07%2F03%2F2016
Obviously, I'm not getting something (or everything) right here.
Actually, I can't just pick "some of the params", I have to send all of them. I didn't do that at first because the params line that I got from the DevTools was too long (47012 characters), Excel-VBA doesn't acccept a line that long. So I copied the params to a text file and then sent the request using that file, and It worked.