I am coding VBA web-scraping software to grab products names from web and add them into Excel worksheet. This code was working fine a minute ago and then all of a sudden it stopped scraping the information. Any ideas what might be the problem? Website is still up and running and no inspected variables have been changed. Here is my code:
Dim http As New XMLHTTP60, html As New HTMLDocument, x As Long
With http
.Open "GET", "https://www.notebooksbilliger.de/pc+hardware/grafikkarten+pc+hardware/amdati/rx+6600+amdati/", False
.send
html.body.innerHTML = .responseText
End With
Do
x = x + 1
On Error Resume Next
Cells(x + 1, 1) = html.querySelectorAll("div.product_name a")(x - 1).innerText
Loop Until Err.Number = 91
I even recovered last save which was working 100% too and now it doesn't. I have not added anything else to the code nor changed references.
Is it possible that after multiple tests webpages block data scraping for some times?
I found out that the problem with this code and this specific website is that if you do multiple query's very often then you will be locked out for uncertain time.
Related
I built a very complex SeleniumBasic via VBA via Excel Addin setup that interacts with one of the leading ticketing system websites to scrap and populate data. The system interacts with 90+ different fields/clickables with 3 different pages, 6 different tabs, and nested popups...deployed to 120 users who use the automations about 20 times per day.
And it has been working flawlessly for over a year...
We have just provisioned 20 more users on the same system, and their automations refuse to work.
Here is where I am at with my research:
I am able to manually step through code on the new system and have it successfully go through the entire automation, so the issue probably has something to do with the speed VBA/Selenium is trying to interact with the website.
Once the system is unable to find a field or clickable, it refuses to find any other ones after that.
The way the system is built in order to go as fast as possible is via standard VBA error handling. It fails to find something, it goes to the error handler, the error handler says to wait for one second then try again. Again, this system has been working flawlessly for over a year and is currently working on 120 user's systems.
To see if maybe Selenium was refusing to reload the clickable, I broke out the error handling from a Resume to a Resume Next, and then had a do while loop with a boolean flag to keep trying until it was successful, but after the first failure, it refused to find anything else, including different fields
The one thing the 20 new users have in common is that they are all using v66 of Chrome, whereas all of the systems that are working have old copies of Chrome that are at least 9 months old
Thinking this might be an issue, I grabbed all of the ChromeDrivers and systematically went through one by one to test if we got a different performance with a different ChromeDriver, but all of the ChromeDrivers had the same error
So that is where I stand. I'm wondering if there is some key insight I am missing or a workaround that will get newer versions of Chrome to retry the fields. Or...do I need to try a VBA/Selenium tool other than SeleniumBasic to fix this. Or...do I need to roll back these 20 users to older versions of Chrome.
Thanks for sharing your expertise.
''''Check to see if there are any aliases=================================
For AliasCheck = 2 To AliasCounter + 1
If Hash = Sheets("Temp Subjects & Locations").Range("AY" & AliasCheck) Then
AliasName = Sheets("Temp Subjects & Locations").Range("AY" & AliasCheck)
AliasCount = AliasCount + 1
AliasDisplayName = Sheets("Temp Subjects & Locations").Range("AZ" & AliasCheck)
temp1 = ""
temp2 = ""
Call countryDictionary
'drops the country back into Excel to later remove the dupes
temp1 = Sheets("Temp Subjects & Locations").Range("BA" & AliasCheck)
temp2 = dict(temp1)
Sheets("Temp Subjects & Locations").Range("BG" & AliasCount + 1) = temp2
''''Click to add AKA's names
iframeText = "iframe_win_" & AddParty
robot.SwitchToDefaultContent
robot.SwitchToFrame iframeText
iframeTracker = iframeTracker + 1
iframeText = "iframe_win_" & iframeTracker
robot.SwitchToDefaultContent
robot.SwitchToFrame iframeText
robot.FindElementById("X_SUBJECT_ALTERNATE_NM.X_ALTERNATE_NM").SendKeys (AliasDisplayName)
robot.FindElementById("dijit_form_Button_0").Click
End If
Next AliasCheck
The AddParty variable is a way to track the number of the pop-up we came from.
The iFrameTracker variable is a way to track the number of the pop-up we are going to...the system sequentially numbers its pop-ups...instead of legible names...
The newer systems will make it down to the SendKeys and then decide not to work. On a resume next, it will then refuse to find the OK button ("dijit_form_Button_0")
Here is the code I was playing around with to see if I could get it to retry using a "Resume Next" instead of a "Resume"
robot.FindElementById("X_SUBJECT_ALTERNATE_NM.X_ALTERNATE_NM").SendKeys (AliasDisplayName)
Do While FailRetry = True
FailRetry = False
robot.FindElementById("X_SUBJECT_ALTERNATE_NM.X_ALTERNATE_NM").SendKeys (AliasDisplayName)
Loop
errHandler4:
If errorCounter < 21 Then
Application.wait (Now + TimeValue("00:00:01"))
errorCounter = errorCounter + 1
FailRetry = True
Resume Next
Else
MsgBox "Reached 20 second timeout. Stopping processing."
Exit Sub
End If
I understand there have been answered for similar questions but I am not sure if I could not understand how to approach to the solutions from other people' answers or my the website I need to get the information from is complex. So, please help me.
I would like to get the description field from Delphi for PN#13511996, the value should be "3 Way Gray GT 150 Sealed Female Connector Assembly, Max Current 15 amps" . Could someone help me examine the website and let me know how to get the description?
Sub GetData()
'Added Microsoft HTML Object library to reff
'Added Microsoft XML, v6.0 to reff
Dim xhr As MSXML2.XMLHTTP60
Dim doc As MSHTML.HTMLDocument
Dim desc As String
Set xhr = New MSXML2.XMLHTTP60
With xhr
.Open "GET", "http://ecat.delphi.com/feature?search=13511996", False
.send
If .ReadyState = 4 And .Status = 200 Then
Set doc = New MSHTML.HTMLDocument
doc.body.innerHTML = .responseText
End If
End With
With doc
desc = .getElementsByClassName("ProductDetail.Description").Item(0).innerText
End With
Debug.Print desc
End Sub
This is because you are requesting raw HTML by using GET from XMLHTTP. If you try to Debug.Print doc.body.innerHTML, you will see that the table has not been generated yet, and the text you are looking for is not there at all.
To be able to run the query for item "13511996", you need a real browser. Only then you can generate your table and get DOM document object. Try the following code:
Sub GetData()
Dim aIE As InternetExplorer
Dim desc As IHTMLElement
Set aIE = New InternetExplorer
With aIE
.navigate "http://ecat.delphi.com/feature?search=13511996"
.Visible = True '----> set it to false if you dont want to see the browser
End With
Do While (aIE.Busy Or aIE.ReadyState <> READYSTATE_COMPLETE)
DoEvents
Loop
Set desc = aIE.document.getElementsByClassName("DetailAttributes")(0)
'Debug.Print desc.innerText '---> prints the whole table data
Debug.Print Split(desc.innerText, vbLf)(3) '----> prints the forth data in table
Set aIE = Nothing
Set desc = Nothing
End Sub
And also if you plan to automate this code to run in a loop for multiple queries, you might want to use:
Set desc = Nothing
For i = 1 To 100
On Error Resume Next
Set desc = aIE.document.getElementsByClassName("DetailAttributes")(0)
If Err.Number = 91 Then
GoTo Skip
End If
Exit For
Skip:
Application.Wait (Now() + TimeValue("00:00:001"))
Next i
instead of:
Set desc = aIE.document.getElementsByClassName("DetailAttributes")(0)
This is because sometimes web page seems ready before it fully generates its contents. This causes the code to get out of do loop and proceed to next statement which sets desc object. You won't get an error while setting it because the code will be using previous DOM document object and will be outputting the results of your previous query, which is a bug. Without any errors, your code will run the loop till the end, and you will have a completely twisted output in your hand, which is a waste of time.
To work around this problem, you should set the object to nothing beforehand, and catch the error and wait for the page to load in for loop.
Last but not least, if the guys who build the web page that you are parsing are aware of what they are doing, they will probably protect it from multiple queries from the same source (most likely from multiple sources as well), which might cause their server to collapse if they don't. This protection will be reflected to you as limited number of queries within a limited amount of time. In other words, for example after 100 request within 5 minutes, the web page will not be responding for sometime (for example 2 minutes).
To workaround this problem, you should limit the number of requests and wait for the required time. Suppose that you increment your loop with i variable. Then you need to insert this at the end of your loop:
If i Mod 100 = 0 Then
Application.Wait (Now() + TimeValue("00:02:00"))
End If
I hope the above mentioned solutions solve everyone's past and future problems, which took me a considerable amount of time to figure out.
We have a Excel-List of URLs with a lot of parameters.
The problem is: The first time you follow a link, you get redirected to a ADFS-Login, which cuts some of the Parameters, since they have a maximum URL-length.
My question: Is there a possibility to tell excel (be it via VBA or default) to use an existing Session?
I tried some shennenigans, for example via Chrome: Find the Window handle for a Chrome Browser or to take an existing IE-Window: http://www.mrexcel.com/forum/excel-questions/553580-visual-basic-applications-macro-already-open-ie-window.html While I get an existing Window, it seems like it always gets redirected and the URL cut. Is there anyhow a possibility to make this?
Please try this and post feedback
Open Sheet1
In Column A, from row 2 create your list of URLS
Insert ActiveXControl Microsoft Web Browser WebBrowser1
Size the control to your needs
Insert Control Button outside the bounds of the browser
Change name of the button to NextButton
Open Code Editor (Alt+F11)
In Sheet1 place the below code
Dim currentURLRow As Integer ''Sheet level variable
Sub NextButton_Click()
On Error Resume Next
Dim url As String
''VBA evaluates second expression even when the first of OR is true. So on error resume next helps here
If currentURLRow = 0 Or Trim(Cells(currentURLRow, 1)) = "" Then
''First time or loop back
currentURLRow = 2
Else
currentURLRow = currentURLRow + 1
End If
On Error GoTo 0 ''reset error so you know of any (good) errors
url = Cells(currentURLRow, 1)
''Sheet1.WebBrowser1.Silent = True ''Uncomment this if you are seeing lot of script errors that you dont want to see
WebBrowser1.Navigate url
Debug.Print WebBrowser1.Document.body.InnerHTML ''' Here you can do magic if the urls you are navigating are serialisable to objects :)
End Sub
Now the first time you navigate to the site, you should be prompted for user name and password, on click of next, your session to saved.
The following Excel macro, which is making an xmlhttp request to this webpage to retrieve some values at a second stage, has worked normally in VBA until some time ago:
Sub WebReq()
Link = "http://it.finance.yahoo.com/q?s=^FCHI&ql=10" & str(rnd())
Set htm = CreateObject("htmlFile")
Set RequestWeb = CreateObject("msxml2.xmlhttp")
With RequestWeb
.Open "GET", "" & Link & "", False
.send
htm.body.innerhtml = .responsetext
End With
End Sub
Now, instead, at the call of the method:
.send
of the object msxml2.xmlhttp is raising the following error:
Run-time error '-2147024891 (80070005)'
Access is denied.
I've been looking on the web but all the similar threads are never answered. Can anyone explain me what this error means, and if there's any way I could fix it or even just work around it?
Note: the random string at the end of the variable 'Link' has been added to force the page reloading, since the script is retrieving real-time values and so it should be loaded every time.
Additional information: while looking for a solution, I'm noticing now that the random part of the link is yielding always the same value even when I end the running and restart again:
Link = http://it.finance.yahoo.com/q?s=^FCHI&ql=10 .7055475
Why is this happening? Shouldn't rnd() yield a new random value between 0 and 1 at every call?
Use
CreateObject("MSXML2.ServerXMLHTTP.6.0")
The standard request fired from a local machine forbids access to sites that aren't trusted by IE. MSXML2.ServerXMLHTTP.6.0 is the server-side object, that doesn't perform those checks.
i found that, in my case, changing http to https fixed the access denied problem. i can only assume that the website somehow made a change and didn't tell anyone
access denied is IE issue
internet options > security tab > custom security level > Miscellaneous >Access data sources across domains > enable
Update
Sub WebReq()
link = "http://it.finance.yahoo.com/q?s=^FCHI&ql=10" & Str(Rnd())
Set htm = CreateObject("htmlFile")
Dim objHttp
Set objHttp = CreateObject("Msxml2.ServerXMLHTTP")
objHttp.Open "GET", link, False
objHttp.Send
htm.body.innerhtml = objHttp.responsetext
Set objHttp = Nothing
End Sub
This works for me:
With CreateObject("MSXML2.ServerXMLHTTP.6.0")
.Open "GET", URL, False
.Send
content = .ResponseText
End With
In my case the user didn't have AD permissions to the proxy server on our corporate network. (Simple oversight when setting up the user.) Adding the missing security group fixed the problem for the user.
I was able to fix it by changing the link being passed from being "http://" to "https://"
The site I was pulling had upgraded and trying to pull the data using the unsecured link was failing. Works great now (no code change required.
add "www" after "https://" in your custom Link,
Like this:
XMLPage.Open "GET", "https://www.x-rates.com/table/?from=GBP&amount=3", False
XMLPage.send
I'm afraid I don't understand exactly why this problem occurs, but I'm guessing it is the secure "https://" versus insecure "http://". I ran into the same "access denied" message while following sample code from a VBA course. The original code was:
XMLPage.Open "GET", "http://x-rates.com/table/?from=GBP&amount=3", False
XMLPage.send
I changed the "http://" to "https://" and the error went away.
I am writing a macro that will scrape my company's internal SAP site for vendor information. For several reasons I have to use VBA to do so. However, I cannot figure out why I keep getting these three errors when I attempt to scrape the page. Is it possible that this has something to do with the UAC integrity model? Or is there something wrong with my code? Is it possible for a webpage using http can be handled differently in internet explorer? I am able to go to any webpage, even other internal webpages, and can scrape each of those just fine. But when i attempt to scrape the SAP page, i get these errors. The error descriptions and when they occur are:
800706B5 - The interface is unknown (occurs when I place breakpoints before running the offending code)
80004005 - Unspecified error (occurs when I don't place any errors and just let the macro run)
80010108 - The Object invoked has disconnected from its clients. (I can't seem to get a consistent occurrence of this error, it seems to happen around the time that something in excel is so corrupted that no page will load and i have to reinstall excel)
I have absolutely no idea what is going on. The Integrity page didn't make much sense to me, and all the research I found on this talked about connecting to databases and using ADO and COM references. However I am doing everything through Internet Explorer. Here is my relevant code below:
Private Sub runTest_Click()
ie.visible = True
doScrape
End Sub
'The code to run the module
Private Sub doTest()
Dim result As String
result = PageScraper.scrapeSAPPage("<some num>")
End Sub
PageScraper Module
Public Function scrapeSAPPage(num As Long) As String
'Predefined URL that appends num onto end to navigate to specific record in SAP
Dim url As String: url = "<url here>"
Dim ie as InternetExplorer
set ie = CreateObject("internetexplorer.application")
Dim doc as HTMLDocument
ie.navigate url 'Will always sucessfully open page, regardless of SAP or other
'pauses the exection of the code until the webpage has loaded
Do
'Will always fail on next line when attempting SAP site with error
If Not ie.Busy And ie.ReadyState = 4 Then
Application.Wait (Now + TimeValue("00:00:01"))
If Not ie.Busy And ie.ReadyState = 4 Then
Exit Do
End If
End If
DoEvents
Loop
Set doc = ie.document 'After implementation of Tim Williams changes, breaks here
'Scraping code here, not relevant
End Function
I am using IE9 and Excel 2010 on a Windows 7 machine. Any help or insight you can provide would be greatly appreciated. Thank you.
I do this type of scraping frequently and have found it very difficult to make IE automation work 100% reliably with errors like those you have found. As they are often timing issues it can be very frustrating to debug as they don't appear when you step through, only during live runs To minimize the errors I do the following:
Introduce more delays; ie.busy and ie.ReadyState don't necessarily give valid answers IMMEDIATELY after an ie.navigate, so introduce a short delay after ie.navigate. For things I'm loading 1 to 2 seconds normally but anything over 500ms seems to work.
Make sure IE is in a clean state by going ie.navigate "about:blank" before going to the target url.
After that you should have a valid IE object and you'll have to look at it to see what you've got inside. Generally I avoid trying to access the entire ie.document and instead use IE.document.all.tags("x") where 'x' is a suitable thing I'm looking for such as td or a.
However after all these improvements although they have increased my success rate I still have errors at random.
My real solution has been to abandon IE and instead do my work using xmlhttp.
If you are parsing out your data using text operations on the document then it will be a no-brainer to swap over. The xmlhttp object is MUCH more reliable. and you just get the "responsetext" to access the entire html of the document.
Here is a simplified version of what I'm using in production now for scraping, it's so reliable it runs overnight generating millions of rows without error.
Public Sub Main()
Dim obj As MSXML2.ServerXMLHTTP
Dim strData As String
Dim errCount As Integer
' create an xmlhttp object - you will need to reference to the MS XML HTTP library, any version will do
' but I'm using Microsoft XML, v6.0 (c:\windows\system32\msxml6.dll)
Set obj = New MSXML2.ServerXMLHTTP
' Get the url - I set the last param to Async=true so that it returns right away then lets me wait in
' code rather than trust it, but on an internal network "false" might be better for you.
obj.Open "GET", "http://www.google.com", True
obj.send ' this line actually does the HTTP GET
' Wait for a completion up to 10 seconds
errCount = 0
While obj.readyState < 4 And errCount < 10
DoEvents
obj.waitForResponse 1 ' this is an up-to-one-second delay
errCount = errCount + 1
Wend
If obj.readyState = 4 Then ' I do these on two
If obj.Status = 200 Then ' different lines to avoid certain error cases
strData = obj.responseText
End If
End If
obj.abort ' in real code I use some on error resume next, so at this point it is possible I have a failed
' get and so best to abort it before I try again
Debug.Print strData
End Sub
Hope that helps.