Web-Crawler for VBA - vba

I am trying to program a Webcrawler, using Visual Basic. I have a list with links, stored in an Excel (column 1). The Macro should then open each link and add certain information from the website to the excel file.
Here's the first link (stored in field A2).
The Macro should identify and insert the name of the hotel into column 2 (B2), the rating in column 3 (C2) and the address in column 4 (D2). This process could then be repeated with a loop for all other links (all websites have the same structure).
My code so far (I did not add the loop yet):
Sub Hoteldetails()
Dim IEexp As Object
Set IEexp = CreateObject("InternetExplorer.Application")
IEexp.Visible = True
Range("A2").Select
Selection.Hyperlinks(1).Follow NewWindow:=False, AddHistory:=True
End Sub
How can I "select" the specific data I want and insert it into the excel file? I tried to record the macro via "Add Data", but was not able to import the data from the website. I also tried to do it by using various example codes, but it did not work out for my specific website.
Thanks a lot for any assistance!

tl;dr;
I am not going to do all the work for you but this is fairly easy if the pages have the same structure.
You can issue a browserless XMLHTTP request, to get a nice fast response, and then select the items of interest using either id or classname and collection index.
Here is an example, using the link you provided, which you can adapt into a loop over all links.
Webpage view:
Code output:
VBA:
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, HTML As New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.tripadvisor.co.uk/Hotel_Review-g198832-d236315-Reviews-Grand_Hotel_Kronenhof-Pontresina_Engadin_St_Moritz_Canton_of_Graubunden_Swiss_Alps.html", False
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
With HTML
.body.innerHTML = sResponse
Debug.Print "HotelName: " & .getElementById("HEADING").innerText
Debug.Print "Address: " & .getElementsByClassName("detail")(0).innerText
Debug.Print "Rating: " & .getElementsByClassName("overallRating")(0).innerText
End With
End Sub
References:
VBE > Tools > References > HTML Object Library

You have several options:
Option 1: IEObject
Either you need to use the getElementBy methods in IEObject and use string manipulation to extract the data you need. 2 options for string extractions:
Extracting a top-level element by Name or by Id then use string manipulation functions such as Mid, InStr, Left and Right
Use Regex (VBA Vbscript object) to extract the data (recommended)
Option 2: Scrape HTML Add-In
Sometime ago I developed an AddIn for Excel that allows you to easily scrape HTML data within an Excel formula. The process is similar as above as you still need to create a relevant Regex. See an example below for TripAdvisor:
The formula in B2 looks like this (A2 is the link, and the second argument is the Regex):
=GetElementByRegex(A2;"<h1 id=""HEADING"".*?>(?:(?:.|\n)*?)</div>((?:.|\n)*?)</h1>")
You can download the AddIn here:
http://www.analystcave.com/excel-tools/excel-scrape-html-add/

Related

How to avoid runtime error 5152 when trying to add a URL as shape to a Word document

I am trying to place a QR code generated through an API (api.qrserver,com) in a Word table using VBA. For certain reasons, the option of simply using "DisplayBarcode" is not possible.
This is the call to the API:
sURL = "https://api.qrserver.com/v1/create-qr-code/?data=" & UTF8_URL_Encode(VBA.Replace(QR_Value, " ", "+")) & "&size=240x240"
It seems to work well. I tried with a GET command and retrieved a string that - as I interpret - contains the QR code in png format.
Now, when I try to add the picture as a shape using
Set objGrafik = ActiveDocument.Shapes.AddPicture(sURL, True)
the call fails with runtime error 5152. As far as I could determine until now, the Addpicture method expects a pure filename and does not allow any of the following characters: /|?*<>:".
I also tried to store the GET result in an object variable:
Set oQRCode = http.responseText
but there I get the error "object required".
Research on the internet regarding a solution to either make the URL assignment work or to store the result as a picture didn't retrieve any useful results. Thanks in advance for your support
I am not sure that any of the ways you could insert something into Word (e.g. Shapes.AddPicture, InlineShapes.AddPicture, Range.InsertFile etc. will let you do that from any old https Url, although it seems to work for some Urls.
However, as it happens, you can use an INCLUDEPICTURE field to do it. FOr example
{ INCLUDEPICTURE https://api.qrserver.com/v1/create-qr-code/?data=Test&size=100x100 }
Here's some sample VBA to do that
Sub insertqrcode()
Dim sUrl As String
Dim f As Word.Field
Dim r1 As Word.Range
Dim r2 As Word.Range
' This is just a test - plug your own Url in.
sUrl = "https://api.qrserver.com/v1/create-qr-code/?data=abcdef&size=100x100"
' Pick an empty test cell in a table
Set r1 = ActiveDocument.Tables(1).Cell(5, 4).Range
' We have to remove the end of cell character to add the field to the range
' Simplify
Set r2 = r1.Duplicate
r2.Collapse WdCollapseDirection.wdCollapseStart
Set f = r2.Fields.Add(r2, WdFieldType.wdFieldIncludePicture, sUrl, False)
' If you don't want the field code any more, do this
f.Unlink
' check that we have a new inline shape that we could work with more
' if necessary
Debug.Print r1.InlineShapes.count
Set f = Nothing
Set r2 = Nothing
Set r1 = Nothing
End Sub
Using INCLUDEPICTURE works even on the current version of Mac Word (although I haven't tested that specific piece of VBA on Mac).
The only other way I have seen to do it uses WinHTTP to get the result from the web service, ADODB to stream it out to a local disk file, then AddPicture or whatever to include that file, so much more complicated and won't work on Mac either.

Copy specific field from different websites to worksheet

I need a vbscript that could be used to copy an output from different webpages and copy it into excel sheet
Example:
Website like truecaller.Com which you can search for people by phone number.
Each number represent by unique web address ex(www.truecaller.com/au/439965324)
I need to make an excel sheet that has two columns; the 1st one is the web address and the 2nd one 8s the related name
Excel VBA is not the best for web scraping but it can get the job done.
Firstly you'll need to make sure you download the latest Internet Explorer, or at least ensure you have version 9 or above.
Secondly, you'll have to enable some references on your macros (these are analogous to imports in languages like Java). To do this, open your VBA editor, and go to Tools > References. You'll want to tick Microsoft Internet Controls and Microsoft HTML Object Library.
Now you're good to go, the following code should work for you. Not being a member of true caller, I only see "-" in the name field, but I imagine it's different if you have an account. The script I've made simply pulls out the name, number and address. I'm sure you won't have a problem with looping through your desired URLs and then placing the grabbed data where you want them.
Sub Test()
'to refer to the running copy of Internet Explorer
Dim ie As InternetExplorer
'to refer to the HTML document returned
Dim html As HTMLDocument
'open Internet Explorer in memory, and go to website
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate "www.truecaller.com/au/439965324"
'Wait until IE is done loading page
Do While ie.readyState <> READYSTATE_COMPLETE
Application.StatusBar = "Trying to go to StackOverflow ..."
DoEvents
Loop
'show text of HTML document returned
Set html = ie.document
MsgBox html.DocumentElement.innerHTML
Dim element As IHTMLElement
Set element = html.getElementsByClassName("result__details")(0)
Dim Name As String
Dim Number As String
Dim Address As String
Name = element.Children(0).Children(1).innerText
Number = element.Children(1).Children(1).innerText
Address = element.Children(2).Children(1).innerText
MsgBox ("Name is " & Name & " with number " & Number & ". Address: " & Address)
'close down IE and reset status bar
Set ie = Nothing
Application.StatusBar = ""
End Sub
If you want to learn more about scraping with VBA then here's a good link:
http://www.wiseowl.co.uk/blog/s393/scraping-websites-vba.htm

Reading, Writing and controlling Autocad using external VBA

I'm using MS-Access 2010 and Autocad 2012 64bit and work in manufacturing.
I want to be able to at the very least, populate fields in a title block, better still I would like to use data in my access database to write data into a sheet set (the current system works by reading the sheet set values such as sheet title and number into my title block).
The following code is all I have at the moment and it will open autocad and write the date into the command line.
Private Sub OpenAutocad_Click()
Dim CadApp As AcadApplication
Dim CadDoc As AutoCAD.AcadDocument
On Error Resume Next
Set CadApp = GetObject(, "AutoCAD.Application")
If Err.Number <> 0 Then
Set CadApp = CreateObject("AutoCAD.Application")
End If
On Error GoTo 0
CadApp.Visible = True
CadApp.WindowState = acMax
Set CadDoc = CadApp.ActiveDocument
CadDoc.Utility.Prompt "Hello from Access, the time is: " & TheTime
Set CadApp = Nothing
End Sub
I have no idea where to go from here. What are the commands to control the sheet set manager and change data, and can the .dst file be edited without even opening up autocad? is there a list of all available autocad vba commands and functions?
If you are declaring CadApp as AcadApplication you must have added a reference to AutoCAD.
That means you should be able to see the object model using your Object Browser in your VBA IDE. No?
There is also a very helpful site www.theswamp.org which has a whole section devoted to AutoCAD VBA.
If I understand your question correctly, you want to automate filling attributes in a drawing title blocks (such as title, drawer, part number, etc) right from MS Access.
Your code can access the Autocad command line already, but Autocad doesn't seem to have the exact command for filling drawing attribute. (command list)
So looks like you need to fill the attributes programatically using the COM API.
The following question appears to be relevant with yours and the accepted answers does provide a sample code:
Is it possible to edit block attributes in AutoCAD using Autodesk.AutoCAD.Interop?
Note that in that question the asker was developing a standalone application in C# .NET, where as you will be using VB Automation from MS Access. Shouldn't be too different since the Component Object Model (COM) being used is the same.
What are the commands to control the sheet set manager and change data and can the .dst file be edited without even opening up autocad?
(sorry can't post more than 2 links)
docs.autodesk.com/ACD/2010/ENU/AutoCAD%202010%20User%20Documentation/files/WS1a9193826455f5ffa23ce210c4a30acaf-7470.htm
No mention about data change, though.
is there a list of all available autocad vba commands and functions?
Yes.
%ProgramFiles%\Common Files\Autodesk Shared\acad_aag.chm - Developer's Guide
%ProgramFiles%\Common Files\Autodesk Shared\acadauto.chm - Reference Guide
Online version:
help.autodesk.com/cloudhelp/2015/ENU/AutoCAD-ActiveX/files/GUID-36BF58F3-537D-4B59-BEFE-2D0FEF5A4443.htm
help.autodesk.com/cloudhelp/2015/ENU/AutoCAD-ActiveX/files/GUID-5D302758-ED3F-4062-A254-FB57BAB01C44.htm
More references here:
http://usa.autodesk.com/adsk/servlet/index?id=1911627&siteID=123112
:) Half the way gone ;)
If you has a open autocad with a loaded drawing you can access the whole thing directly.
Sub block_set_attribute(blo As AcadBlockReference, tagname, tagvalue)
Dim ATTLIST As Variant
If blo Is Nothing Then Exit Sub
If blo.hasattributes Then
tagname = Trim(UCase(tagname))
ATTLIST = blo.GetAttributes
For i = LBound(ATTLIST) To UBound(ATTLIST)
If UCase(ATTLIST(i).TAGSTRING) = tagname Or UCase(Trim(ATTLIST(i).TAGSTRING)) = tagname & "_001" Then
'On Error Resume Next
ATTLIST(i).textString = "" & tagvalue
Exit Sub
End If
Next
End If
End Sub
Sub findtitleblock(TITLEBLOCKNAME As String, attributename As String,
attributevalue As String)
Dim entity As AcadEntity
Dim block As acadblcck
Dim blockref As AcadBlockReference
For Each block In ThisDrawing.BLOCKS
For Each entity In block
If InStr(LCase(entity.objectname), "blockref") > 0 Then
Set blockref = entity
If blockref.effectivename = TITLEBLOCKNAME Then
Call block_set_attribute(blockref, attributename, attributevalue)
exit for
End If
End If
End If
Next
Next
End Sub
call findtitleblock("HEADER","TITLE","Bridge column AXIS A_A")
So assume you has a title block which has the attribute TITLE then it will set the Attribute to the drawing name. it mioght also possible you has to replace the thisdrawing. with your Caddoc. I usually control Access and Excel form autocad and not vice versa ;)
consider also to use "REGEN" and "ATTSYNC" if "nothing happens"
thisdrawing.sendcommens("_attsync" 6 vblf )

VBA list of filepaths of linked objects in document

I have a number of large Microsoft Word documents with many linked files from many Microsoft Excel spreadsheets. When opening a Word document, even with the 'update linked files at open' option unchecked:
Word still checks each link at its source by opening and closing the relevant excel spreadsheet for each individual link (so for x number of links, even if from the same spreadsheet, Word will open and close the spreadsheet x times). This means opening documents takes a very long time.
I have found that documents open faster if the spreadsheets containing the source of linked objects are already open, so Word doesn't keep opening, closing, reopening them.
So far, the beginnings of a solution I have is to create a list of all the filepaths of the linked objects, done by following VBA code:
Sub TypeArray()
Dim List(), Path As String
Dim i, x As Integer
Dim s As InlineShape
Dim fso As FileSystemObject, ts As TextStream
Set fso = New FileSystemObject
Set ts = fso.OpenTextFile("C:\MyFolder\List.txt", 8, True)
With ts
.WriteLine (ActiveDocument.InlineShapes.Count)
End With
For Each s In ActiveDocument.InlineShapes
Path = s.LinkFormat.SourcePath & "\" _
& s.LinkFormat.SourceName
With ts
.WriteLine (Path)
End With
Next s
End Sub
'--------------------------------------------------------------------------------------
Private Sub WriteStringToFile(pFileName As String, pString As String)
Dim intFileNum As Integer
intFileNum = FreeFile
Open pFileName For Append As intFileNum
Print #intFileNum, pString
Close intFileNum
End Sub
'--------------------------------------------------------------------------------------
Private Sub SendFileToNotePad(pFileName As String)
Dim lngReturn As Long
lngReturn = Shell("NOTEPAD.EXE " & pFileName, vbNormalFocus)
End Sub
which works well, but can only be used after a document is already open, which defeats its purpose.
So, finally, my question(s) are these:
1) Is there a way to run this code (or any better, more efficient code - suggestions are welcome) before opening a Word document and waiting through the long process of checking each link at its source?
2) Is there a way to avoid all this and simply have Word not check the links when it I open a document?
Sorry for the long question, and thank you for the help!
If I am not wrong there should be Document_Open event according to msdn. This should actually be a before open document and should be fired before updating links (at least it in excel it is fired before calculation).
Try opening the files on document open. Then you will face another problem, and so when to close the files, but that is a much easier thing to do. (probably document_close event...)
EDITTED:
As comments state, this is too late. You can create a word opener (as a single app or as an addin). The logic basically is:
'1) on something_open run GetOpenFileName dialog
'2) before opening the real thing, open all files accompanied
'3) open the document itself
'4) close all files
'5) close the opener itself
This is not the most trivial way, but I use this logic for exampe to make sure, that my applications always runs in a fresh copy of excel etc. But I understand that this is a workaround rather then a solution.
If you are still looking for something on this front, I created the following in a combination of VBA and VB.NET (in VS 2010) to show what can be done quite easily using that system. If VB.NET is no use to you, sorry, but there are reasons why I don't really want to spend time on the pure VBA approach.
At present, it is a "console" application which means you'll probably see a box flash up when it runs, but also means that you are more likely to be able to create this app without VS if you absolutely had to (AFAICR the VB.NET /compiler/ is actually free). It just fetches the link info. (i.e. there's currently no facility to modify links).
The overview is that you have a small piece of VBA (say, in your Normal template) and you need an open document. The VBA starts a Windows Shell, runs the VB.NET program and passes it the full path name of the document you want to open.
The VB.NET program opens the .docx (or whatever) and looks at all the Relationships of type "oleObject" that are referenced from the Main document part (so right now, the code ignores headers, footers, footnotes, endnotes and anywhere else you might have a link)
The VB.NET program automates Word (which we know is running) and writes each link URL into a sequence of Document Variables in the active document. These variables are called "Link1", "Link2", etc. If there are no links (I haven't actually tested that path properly) or the program can't find the file, "Link0" should be set to "0". Otherwise it should be set to the link count.
The shell executes synchronously, so your VBA resumes when it's done. Then you either have 0 links, or a set of links that you can process.
The VBA is like this:
Sub getLinkInfo()
' the full path name of the program, quoted if there are any spaces in it
' You would need to modify this
Const theProgram As String = """C:\VBNET\getmaindocumentolelinks.exe"""
' You will need a VBA reference to the "Windows Script Host Object Model"
Dim oShell As WshShell
Set oShell = CreateObject("WScript.Shell")
' plug your document name in here (again, notice the double quotes)
If oShell.Run(theProgram & " ""c:\a\testdocexplorer.docx""", , True) = 0 Then
With ActiveDocument.Variables
For i = 1 To CInt(.Item("Link0").Value)
Debug.Print .Item("Link" & CStr(i))
Next
End With
Else
MsgBox "Attempt to retrieve links failed"
End If
End Sub
For the VB.NET, you would need the Office Open XML SDK (I think it's version 2.5). You need to make references to that, and Microsoft.Office.Interop.Word.
The code is as follows:
Imports System.Collections.Generic
Imports System.Linq
Imports System.Text
Imports System.IO
Imports System.Xml
Imports System.Xml.Linq
Imports DocumentFormat.OpenXml.Packaging
Imports Word = Microsoft.Office.Interop.Word
Module Module1
Const OLEOBJECT As String = "http://schemas.openxmlformats.org/officeDocument/2006/relationships/oleObject"
Sub Main()
Dim s() As String = System.Environment.GetCommandLineArgs()
If UBound(s) > 0 Then
Dim wordApp As Word.Application
Try
wordApp = GetObject(, "Word.Application")
Dim targetDoc As Word.Document = wordApp.ActiveDocument
Try
Dim OOXMLDoc As WordprocessingDocument = WordprocessingDocument.Open(path:=s(1), isEditable:=False)
Dim linkUris As IEnumerable(Of System.Uri) = From rel In OOXMLDoc.MainDocumentPart.ExternalRelationships _
Where rel.RelationshipType = OLEOBJECT _
Select rel.Uri
For link As Integer = 0 To linkUris.Count - 1
targetDoc.Variables("Link" & CStr(link + 1)).Value = linkUris(link).ToString
Next
targetDoc.Variables("Link0").Value = CStr(linkUris.Count)
OOXMLDoc.Close()
Catch ex As Exception
targetDoc.Variables("Link0").Value = "0"
End Try
Finally
wordApp = Nothing
End Try
End If
End Sub
End Module
I originally wrote the .NET code as a COM object, which would be slightly easier to use from VBA, but significantly harder to set up on the .NET side and (frankly) much harder to modify & debug as you have constantly to close Word to release the references to the COM DLLs.
If you actually wanted to fix up the LINK paths, as far as I can tell, modifying them in the relationship records is enough to get Word to update the relevant LINK fields when it opens Word, which saves having to modify the XML code for the LINK fields as well. But that's another story...
I just found out that you can set/modify a DelayOleSrvParseDisplayName registry entry and a NoActivateOleLinkObjAtOpen registry entry to modify the global behaviour:
See http://support.microsoft.com/kb/970154
I also found that activedocument.fields can contain links to external objects (in my case, an Excel sheet).
Use this code to parse them:
for each f in activedocument.fields
debug.print f.code
next
And use activedocument.fields(FIELDNUMBER) to select each object, to figure out where it is in the document.
Maybe also activedocument.Variables and activedocument.Hyperlinks can contain links to external objects? (not in my case).

One Central Header/Footer used by Multiple Docs (Word 2003 or 2007)

Inside Word (2003 or 2007), is there a way to have one Header/Footer that is used by Multiple documents?
I want to be able to change the header/footer in one spot and have it affect multiple documents.
i.e. I have 50 documents and they all have the same header/footer. Instead of opening all 50 documents to make the change, is there a way to link (OLE?) the 50 documents to a main document and only have to change the main document?
If there is not a built in way, has anyone done this using VBA?
I'm not sure how will this will work in practice, but you can insert other files into a Word document as a link.
First create the document with the header/footer content, with the content in the body of the document. Save it.
Then go to one of your 50 documents, go into the header/footer. Go to INSERT | FILE. Locate the first file, then click the little drop-down arrow next to the OPEN button in the Insert File dialog. From the drop-down, select INSERT AS LINK. The content should now show up in the document. If you click in the content, normally it will have a grey background, to indicate it's really a Word field.
Now when you change the first document, you can open the second document, update the field (click anywhere in it and hit F9) and the new content will be pulled in. You can also update fields programmatically pretty easy, or under TOOLS | OPTIONS | PRINT, there's a box to auto update the fields every time the document is printed.
AFAIK to alter a documents header (simply) must be done by having the document open. That said you have a few options. First if the documents are saved in the office XML format then you could open the files using the MSXML library and alter the data in the header. (Or any of the dozens of other ways to alter what is essentially a text file.) If the file(s) are still in the binary format you really only have one of two options. The first is to open the file via vba and alter the header via the document object model. The second would be to figure out the binary format (which is documented) and alter it using the VB6/VBA native binary IO (very non-trivial).
Unless I thought I could gain more time then I was going to lose writing code to alter the documents directly I would probably just loop through all the file in the folder, open them and alter them. As for storing the header somewhere... You could just put the header data in a text file and pull it in. Or keep a document template somewhere.
Here is a very trivial example:
Public Sub Example()
Dim asFiles() As String
Dim lFile As Long
Dim docCrnt As Word.Document
asFiles = GetFiles("C:\Test\", "*.doc")
For lFile = 0& To UBound(asFiles)
Set docCrnt = Word.Documents.Open(asFiles(lFile))
docCrnt.Windows(1).View.SeekView = wdSeekCurrentPageHeader
Selection.Text = "I am the header."
docCrnt.Close True
Next
End Sub
Public Function GetFiles( _
ByVal folderPath As String, _
Optional ByVal pattern As String = vbNullString _
) As String()
Dim sFile As String
Dim sFolder As String
Dim asRtnVal() As String
Dim lIndx As Long
If Right$(folderPath, 1&) = "\" Then
sFolder = folderPath
Else
sFolder = folderPath & "\"
End If
sFile = Dir(sFolder & pattern)
Do While LenB(sFile)
ReDim Preserve asRtnVal(lIndx) As String
asRtnVal(lIndx) = sFolder & sFile
lIndx = lIndx + 1&
sFile = Dir
Loop
If lIndx = 0& Then
ReDim asRtnVal(-1& To -1&) As String
End If
GetFiles = asRtnVal
Erase asRtnVal
End Function