Scrape html data Vba - vba

I want to make a function that extracts data from a part of a site.
The following is the HTML site. HTML code.
Code for the function
Function GetElementById(url As String, id As String, Optional isVolatile As Boolean)
Application.Volatile (isVolatile)
On Error Resume Next
Dim html As Object, objResult As Object
ret = GetPageContent(url)
Set html = CreateObject("htmlfile")
html.Body.innerHtml = ret
Set objResult = html.GetElementById(id)
GetElementById = objResult.innerHtml
End Function
I need that extracts only the class "panel-body"
directly into the function. I think it would be .children (3). Is that correct?
And so that it is practical and fast, because I need to extract more than 50 sites.

I see at least two options.
Once you have the HTMLDivElement with id=Result you could simply get the children. Please test this by first doing objResult.Children(2) and checking what the element is that is returned.
objResult.Children(2).Children(0).Children(0)
The second is that in later versions of MSHTML I think with IE8 or later installed you have the method "GetElementsByClassName" This will return a collection of IHTMLElements. If the HTMLDocument only has 1 "panel-body" then you are in luck. If not you would need to loop through each one and check some other unique feature to know you have the right one.

Another way to generate the code for this job is to record a macro, then add a loop around the recorded macro that loops through your 50 pages and gets the results.
On the data tab in the ribbon there is an option get data from external sources. If you use this it's gives you a point and click interface that let's you chose the table your looking for. Record a macro while your doing this and it generates the code for you.

Related

Saving Custom Document Properties in a Loop

I'm trying to save the values of data that have been input into my form. There are a total of about 50 different fields to save across 5 different agents, so I loaded the data into arrays.
I've tried saving the fields in a loop, but it doesn't seem to work in a loop, only if each field has a separate line, which is a lot of code and messy. The Ag1Name, Ag2Name and Ag3Name are the names of my textboxes that the user enters to populate the form.
Sub LoadAndSaveData()
NumberofAgents = 3
Dim AgentName(3) as String
AgentName(1) = Ag1Name.Value
AgentName(2) = Ag2Name.Value
AgentName(3) = Ag3Name.Value
For Count = 1 To NumberOfAgents
With ActiveDocument.CustomDocumentProperties
.Add Name:="AgentName" & Count, LinkToContent:=False, Value:=AgentName(Count), Type:=msoPropertyTypeString
End With
Next Count
End Sub
The data doesn't get saved to the Custom Document Properties when the code is set up in a loop like the above. Since there are so many values to save and all the data is already in arrays, I would much prefer to use a loop rather than write out a separate line of code for all ~50 of the values. It does seem to work when each field is saved in a separate line of code.
I think this would probably get what you want. You don't really need to count the document properties first, only increment with the ones you want to update. Hopefully the only document properties you want contain the name AgentName in it.
ReDim AgentName(0) As String
Dim P As Long
For Each c In ThisDocument.CustomDocumentProperties
If InStr(1, c.Name, "AgentName", vbTextCompare) > 0 Then
ReDim Preserve AgentName(P)
AgentName(P) = c.Value
P = P + 1
End If
Next c
As a guest I cannot post a comment here, but the code you gave works OK here.
However, there is a problem with creating legacy custom document properties programmatically, because doing that does not mark the document as "changed". When you close the document, Word does not necessarily save it and you lose the Properties and their values.
However, if you actually open up the Custom Document Property dialog, Word does then mark the document as "changed" and the Properties are saved.
So it is possible that the difference between your two scenarios is not the code, but that in one scenario you have actually opened the dialog box to check the values before closing the document and in the other you have not.
If that is the case, here, I was able to change this behaviour by adding the line
ActiveDocument.Saved = False
after setting the property values.
If you do not actually need the values to be Document Properties, it might be better either to use Document Variables, which are slightly easier to use since you can add them and modify them with exactly the same code, or perhaps by storing them in A Custom XML Part, which is harder work but can be useful if you need to extract the values somewhere where Word is not available.
You can make this even easier by looping the controls on the UserForm, testing whether the control name contains "Ag" and, if it does, create the Custom Document Property with the control's value - all in one step.
For example, the following code sample loops the controls in the UserForm. It tests whether the controls Name starts with "Ag". If it does, the CustomDocumentProperty is added with that control's value.
Sub LoadAndSaveData()
Dim ctl As MSForms.control
Dim controlName As String
For Each ctl In Me.Controls
controlName = ctl.Name
If Left(controlName, 2) = "Ag" Then
With ActiveDocument.CustomDocumentProperties
.Add Name:=controlName, LinkToContent:=False, value:=ctl.value, Type:=msoPropertyTypeString
End With
End If
Next
End Sub
I feel a little stupid... I just realized that the reason that the code wasn't working was that the variable NumberofAgents was not being calculated correctly elsewhere in my code. I've got it working now. Thanks for your thoughts!

How to get the last child of an HTMLElement

I have written a macro in Excel that opens and parses a website and pulls the data from it. The trouble I'm having is once I'm done with all of the data on the current page I want to go to the next page. To do this I want to get the last child of the "result-stats" node. I found the lastChild function, and so came up with the following code:
'Checks to see if there is a next page
If html.getElementById("result-stats").LastChild.innerText = "Next" Then
html.getElementById("result-stats").LastChild.Click
End If
And here is the HTML that it is accessing:
<p id="result-stats">
949 results
<span class="optional"> (1.06 seconds)</span>
Modify search
Show more columns
Next
</p>
When I try to run this, I get an error. After a lot of searching I think I found the reason. According to what I read, getElementById returns an element and not a node. lastChild only works on nodes, which is why the function doesn't work here.
My question is this. Is there a clean and simple way to grab the last child of an element? Or is there a way to typecast an element to that of a node? I feel like I'm missing something obvious, but I've been at this way longer than I should have been. Any help anyone could provide would be greatly appreciated.
Thanks.
Here's a shell of how to do it. If my comments are not clear, ask away. I assumed knowledge of how to navigate to the page, wait for the browser, etc.
Sub ClickLink()
Dim IE As Object
Set IE = CreateObject("InternetExplorer.Application")
'load up page and all that stuff
'process data ...
'click link
Dim doc As Object
Set doc = IE.document
Dim aLinks As Object, sLink As Object
For Each sLink In doc.getElementsByTagName("a")
If sLink.innerText = "Next" Then 'may need to play with this, if `innerttext' doesn't work
sLink.Click
Exit For
End If
Next
End Sub

Opening multiple webbrowser windows in a loop.

I have an excel spreadsheet and using VBA to code functions. I am having a user enter X number entries into a spread sheet and want to be able to open X number of webbrowsers for each entry. The webbrowsers will go to specific websites. The code I have now works for one entry and I use Webbrowser1.Navigate to "CX". What i want is if there are 5 entries for Webrowswer2....5.Navigate to "CX". Is there a way to have a dynamic webbrowser?
Since you've not included any code, its a little difficult to see what you're doing so far.
If you add the "Microsoft Internet Explorer" reference to VBA you can create an array or browsers. You can do it without adding the reference using CreateObject but you wont get any intellitype help in the editor so might be harder if you don't know the control's methods etc
' Create a large array of them and initialise/destroy them as needed
Dim Browser(1 to 10) As InternetExplorer
' Init browser1
Set Browser(1) = New InternetExplorer
' Destroy browser1
Set Browser(1) = Nothing
Is there any reason you can just use one/two browsers and load the pages one after the other? Depending on what your goal is it might not help that much.

Execute a user-defined function into another cell VBA Excel

I need to automatize a process which execute various (a lot) user-defined function with different input parameters.
I am using the solution of timer API found in I don't want my Excel Add-In to return an array (instead I need a UDF to change other cells) .
My question is the following: "Does anybody can explain to me HOW IT IS WORKING?" If I debug this code in order to understand and change what I need, I simply go crazy.
1) Let say that I am passing to the public function AddTwoNumbers 14 and 45. While inside AddTwoNumber, the Application.Caller and the Application.Caller.Address are chached into a collection (ok, easier than vectors in order not to bother with type). Application.Caller is kind of a structured object where I can find the function called as a string (in this case "my_function"), for example in Application.Caller.Formula.
!!! Nowhere in the collection mCalculatedCells I can find the result 59 stored.
2)Ok, fair enough. Now I pass through the two UDF routines, set the timers, kill the timers.
As soon as I am inside the AfterUDFRoutine2 sub, the mCalculatedCell(1) (the first -- and sole -- item of my collection) has MAGICALLY (??!?!?!!!??) obtained in its Text field exactly the result "59" and apparently the command Set Cell = mCalculatedCells(1) (where on the left I have a Range and on the right I have ... I don't know) is able to put this result "59" into the variable Cell that afterward I can write with the .Offset(0,1) Range property on the cell to the right.
I would like to understand this point because I would like to give MORE task to to inside a single collection or able to wait for the current task to be finished before asking for a new one (otherwise I am over-writing the 59 with the other result). Indeed I read somewhere that all the tasks scheduled with the API setTimer will wait for all the callback to be execute before execute itself (or something like this).
As you can see I am at a loss. Any help would be really really welcomed.
In the following I try to be more specific on what (as a whole)
I am planning to achieved.
To be more specific, I have the function
public function my_function(input1 as string, Field2 as string) as double
/*some code */
end function
I have (let's say) 10 different strings to be given as Field2.
My strategy is as follow:
1)I defined (with a xlw wrapper from a C++ code) the grid of all my input values
2)define as string all the functions "my_function" with the various inputs
3)use the nested API timer as in the link to write my functions IN THE RIGHT CELLS as FORMULAS (not string anymore)
3)use a macro to build the entire worksheet and then retrieve the results.
4)use my xlw wrapper xll to process further my data.
You may wonder WHY should I pass through Excel instead of doing everything in C++. (Sometime I ask myself the same thing...) The prototype my_function that I gave above has inside some Excel Add-In that I need to use and they work only inside Excel.
It is working pretty well IN THE CASE I HAVE ONLY 1 "instance" of my_function to write for the give grid of input. I can even put inside the same worksheet more grids, then use the API trick to write various different my_functions for the different grids and then make a full calculation rebuild of the entire worksheet to obtain the result. It works.
However, as soon as I want to give more tasks inside the same API trick (because for the same grid of input I need more calls to my_function) I am not able to proceed any further.
After Axel Richter's comment I would like to ad some other information
#Axel Richter
Thank you very much for your reply.
Sorry for that, almost surely I wasn't clear with my purposes.
Here I try to sketch an example, I use integer for simplicity and let's say that my_function works pretty much as the SUM function of Excel (even if being an Excel native function I could call SUM directly into VBA but it is for the sake of an example).
If I have these inputs:
input1 = "14.5"
a vector of different values for Field2, for instance (11;0.52;45139)
and then I want to write somewhere my_function (which makes the sum of the two values given as input).
I have to write down in a cell =my_function(14.5;11), in the other =my_function(14.5;0.52) and in a third one =my_function(14.5;45139).
These input changes any time I need to refresh my data, then I cannot use directly a sub (I think) and, in any case, as far as I understand, in writing directly without the trick I linked, I will always obtain strings : something like '=my_function(14.5;0.52). Once evaluated (for example by a full rebuild or going over the written cell and make F2 + return) will give me only the string "=my_function(14.5;0.52)" and not its result.
I tried at the beginning to use an Evaluate method which works well as soon as I write something like 14.5+0.52, but it doesn't work as soon as a function (nor a user-defined function) is used instead.
This is "as far as I can understand". In the case you can enlighten me (and maybe show an easier track to follow), it would be simply GREAT.
So far the comments are correct in that they repeat the simple point that a User-Defined Function called a worksheet can only return a value, and all other actions that might inject values elsewhere into the worksheet calculation tree are forbidden.
That's not the end of the story. You'll notice that there are add-ins, like the Reuters Eikon market data service and Bloomberg for Excel, that provide functions which exist in a single cell but which write blocks of data onto the sheet outside the calling cell.
These functions use the RTD (Real Time Data) API, which is documented on MSDN:
How to create a RTD server for Excel
How to set up and use the RTD function in Excel
You may find this link useful, too:
Excel RTD Servers: Minimal C# Implementation
However, RTD is all about COM servers outside Excel.exe, you have to write them in another language (usually C# or C++), and that isn't the question you asked: you want to do this in VBA.
But I have, at least, made a token effort to give the 'right' answer.
Now for the 'wrong' answer, and actually doing something Microsoft would rather you didn't do. You can't just call a function, call a subroutine or method from the function, and write to the secondary target using the subroutine: Excel will follow the chain and detect that you're injecting values into the sheet calculation, and the write will fail.
You have to insert a break into that chain; and this means using events, or a timer call, or (as in RTD) an external process.
I can suggest two methods that will actually work:
1: Monitor the cell in the Worksheet_Change event:
Private Sub Worksheet_Change(ByVal Target As Range)
Dim strFunc As String strFunc = "NukeThePrimaryTargets" If Left(Target.Formula, Len(strFunc) + 1) = strFunc Then Call NukeTheSecondaryTargets End If End Sub
Alternatively...
2: Use the Timer callback API:
However, I'm not posting code for that: it's complex, clunky, and it takes a lot of testing (so I'd end up posting untested code on StackOverflow). But it does actually work.
I can give you an example of a tested Timer Callback in VBA:
Using the VBA InputBox for passwords and hiding the user's keyboard input with asterisks.
But this is for an unrelated task. Feel free to adapt it if you wish.
Edited with following requirements: It is necessary to run a user defined worksheet function, because there are addins called in this function and those work only within a Excel sheet. The function has to run multiple times with different parameters and its results have to be gotten from the sheet.
So this is my solution now:
Public Function my_function(input1 As Double, input2 As Double) As Double
my_function = input1 + input2
End Function
Private Function getMy_Function_Results(input1 As Double, input2() As Double) As Variant
Dim results() As Double
'set the Formulas
With Worksheets(1)
For i = LBound(input2) To UBound(input2)
strFormula = "=my_function(" & Str(input1) & ", " & Str(input2(i)) & ")"
.Cells(1, i + 1).Formula = strFormula
Next
'get the Results
.Calculate
For i = LBound(input2) To UBound(input2)
ReDim Preserve results(i)
results(i) = .Cells(1, i + 1).Value
Next
End With
getMy_Function_Results = results
End Function
Sub test()
Dim dFieldInput2() As Double
Dim dInput1 As Double
dInput1 = Val(InputBox("Value for input1"))
dInput = 0
iIter = 0
Do
dInput = InputBox("Values for fieldInput2; 0=END")
If Val(dInput) <> 0 Then
ReDim Preserve dFieldInput2(iIter)
dFieldInput2(iIter) = Val(dInput)
iIter = iIter + 1
End If
Loop While dInput <> 0
On Error GoTo noFieldInput2
i = UBound(dFieldInput2)
On Error GoTo 0
vResults = getMy_Function_Results(dInput1, dFieldInput2)
For i = LBound(vResults) To UBound(vResults)
MsgBox vResults(i)
Next
noFieldInput2:
End Sub
The user can input first a value input1 and then input multiple fieldInput2 until he inputs the value 0. Then the results will be calculated and presented.
Greetings
Axel

Get the number of pages in a Word document

I am making lots of changes to a Word document using automation, and then running a VBA macro which - among other things - checks that the document is no more than a certain number of pages.
I'm using ActiveDocument.Information(wdNumberOfPagesInDocument) to get the number of pages, but this method is returning an incorrect result. I think this is because Word has not yet updated the pagination of the document to reflect the changes that I've made.
ActiveDocument.ComputeStatistics(wdStatisticPages) also suffers from the same issue.
I've tried sticking in a call to ActiveDocument.Repaginate, but that makes no difference.
I did have some luck with adding a paragraph to the end of the document and then deleting it again - but that hack seems to no longer work (I've recently moved from Word 2003 to Word 2010).
Is there any way I can force Word to actually repaginate, and/or wait until the repagination is complete?
I just spent a good 2 hours trying to solve this, and I have yet to see this answer on any forum so I thought I would share it.
https://msdn.microsoft.com/en-us/vba/word-vba/articles/pages-object-word?f=255&MSPPError=-2147217396
That gave me my solution combined with combing through the articles to find that most of the solutions people reference are not supported in the newest versions of Word. I don't know what version it changed in, but my assumption is that 2013 and newer can use this code to count pages:
ActiveDocument.ActiveWindow.Panes(1).Pages.Count.
I believe the way this works is ActiveDocument selects the file, ActiveWindow confirms that the file to be used is in the current window (in case the file is open in multiple windows from the view tab), Panes determines that if there is multiple windows/split panes/any other nonsense you want the "first" one to be evaluated, pages.count designates the pages object to be evaluated by counting the number of items in the collection.
Anyone more knowledgeable feel free to correct me, but this is the first method that gave me the correct page count on any document I tried!
Also I apologize but I cant figure out how to format that line into a code block. If the mods want to edit my comment to do that be my guest.
Try (maybe after ActiveDocument.Repaginate)
ActiveDocument.BuiltinDocumentProperties(wdPropertyPages)
It is causing my Word 2010 to spend half-second with "Counting words" status in status bar, while ActiveDocument.ComputeStatistics(wdStatisticPages) returns the result immediately.
Source: https://support.microsoft.com/en-us/kb/185509
After you've made all your changes, you can use OnTime to force a slight delay before reading the page statistics.
Application.OnTime When:=Now + TimeValue("00:00:02"), _
Name:="UpdateStats"
I would also update all the fields before this OnTime statement:
ActiveDocument.Range.Fields.Update
I found a possible workaround below, if not a real answer to the topic question.
Yesterday, the first ComputeStatistics line below was returning the correct total of 31 pages, but today it returns only 1.
The solution is to get rid of the Content object and the correct number of pages is returned.
Dim docMultiple As Document
Set docMultiple = ActiveDocument
lPageCount = docMultiple.Content.ComputeStatistics(wdStatisticPages) ' Returns 1
lPageCount = docMultiple.ComputeStatistics(wdStatisticPages) ' Returns correct count, 31
ActiveDocument.Range.Information(wdNumberOfPagesInDocument)
This works every time for me. It returns total physical pages in the word.
I used this from within Excel
it worked reliably on about 20 documents
none were longer than 20 pages but some were quite complex
with images and page breaks etc.
Sub GetlastPageFromInsideExcel()
Set wD = CreateObject("Word.Application")
Set myDoc = wD.Documents.Open("C:\Temp\mydocument.docx")
myDoc.Characters.Last.Select ' move to end of document
wD.Selection.Collapse ' collapse selection at end
lastPage = wD.Selection.Information(wdActiveEndPageNumber)
mydoc.close
wd.quit
Set wD = Nothing
End Sub
One problem I had in getting "ComputeStatistics" to return a correct page count was that I often work in "Web Layout" view in Word. Whenever you start Word it reverts to the last view mode used. If Word was left in "Web Layout" mode "ComputeStatistics" returned a page count of "1" page for all files processed by the script. Once I specifically set "Print Layout" view I got the correct page counts.
For example:
$MSWord.activewindow.view.type = 3 # 3 is 'wdPrintView', 6 is 'wdWebView'
$Pages = $mydoc.ComputeStatistics(2) # 2 is 'wdStatisticPages'
You can use Pages-Object and its properties such as Count. It works perfect;)
Dim objPages As Pages
Set objPage = ActiveDocument.ActiveWindow.Panes(1).Pages
QuantityOfPages = ActiveDocument.ActiveWindow.Panes(1).Pages.Count
Dim wordapp As Object
Set wordapp = CreateObject("Word.Application")
Dim doc As Object
Set doc = wordapp.Documents.Open(oFile.Path)
Dim pagesCount As Integer
pagesCount = doc.Content.Information(4) 'wdNumberOfPagesInDocument
doc.Close False
Set doc = Nothing