Website data table scraper - vba

Before I ask my question, I'm an amateur coder with basically no meaningful experience beyond VBA in ms office apps (I know - noob!)
I'm trying to create a web scraper using VBA to import data into excel and as per my comments in the below extract of code, the best I've been able to find on this is was in the winning answer to this question.
Below, I'm using investing.com as an example but in reality my project will be across multiple sites and will feed into a matrices which will be updating daily and self cannibalizing as events expire - For this reason I'd rather front-up the workload on the code side to make the inputs on an ongoing basis as minimal as possible (for me).
With that in mind, can I ask if there's a way to do any of the following (brace yourself, this will be cringe-worthy basic knowledge for some):
Is there a way in which I can and navigate to a url and run a for each loop on every table on that page (without have a known id for any)? this is to speed up my code as much as it's to minimise my inputs as there'll be quite a bit of data to be updated and I was planning on putting a 2 minute looping trigger on the refresh.
Instead of doing what I've been doing below, is it possible to reference a table, rather than a row, and do something along the lines of Cells(2,5).value to return the value within row 1, column 4? (assuming that both the array indexing starts at 0 in both dimensions?) Further to that, my first column (my primary key in some ways) may not be in the same order on all sources so is there a way in which I could do the equivalent to Columns("A:A").Find(What:=[Primary key], After:=Cells(1, 1), LookIn:=xlValues, LookAt:=xlWhole, SearchOrder:=xlByRows, SearchDirection:=xlNext, MatchCase:=True, SearchFormat:=False).Row to find what row within the table relates to the even I'm looking for?
Code :
Sub Scraper()
Dim appIE, allRowOfData As Object
' As per https://stackoverflow.com/questions/27066963/scraping-data-from-website-using-vba
Set appIE = CreateObject("internetexplorer.application")
With appIE
.Navigate "http://uk.investing.com/rates-bonds/financial-futures" 'Sample page
.Visible = False
End With
Do While appIE.Busy
Application.Wait (Now + TimeValue("0:00:01")) 'If page not open, wait a second befor trying again
Loop
Set allRowOfData = appIE.document.getElementById("pair_8907")
'tr id="[ID of row within table]"
Dim myValue As String: myValue = allRowOfData.Cells(8).innerHTML
'The 8 is the column number of the table
'(note: column numbers start at 0 so the 9th column should have "8" entered here
Set appIE = Nothing
Range("A1").Value = myValue
End Sub

If you want to use Excel functions to navigate the tables why not dump the tables first onto a worksheet this code works for me
Option Explicit
Sub Scraper()
Dim appIE As Object
' As per http://stackoverflow.com/questions/27066963/scraping-data-from-website-using-vba
Set appIE = CreateObject("internetexplorer.application")
With appIE
.Navigate "http://uk.investing.com/rates-bonds/financial-futures" 'Sample page
.Visible = True
End With
Do While appIE.Busy
DoEvents
Application.Wait (Now + TimeValue("0:00:01")) 'If page not open, wait a second befor trying again
Loop
'Debug.Print TypeName(appIE.document)
Dim doc As Object 'MSHTML.HTMLDocument
Set doc = appIE.document
'* appIE busy is good but you need to wait for the whole document to completely load and initialise so use this
While doc.readyState <> "complete"
DoEvents
Wend
'* we can select all the tables because they share the same CSS class name
Dim tablesSelectedByClass As Object 'MSHTML.HTMLElementCollection
Set tablesSelectedByClass = doc.getElementsByClassName("genTbl")
'* you can change this, it was just convenient for me to add sheets to my workbook
Dim shNewResults As Excel.Worksheet
Set shNewResults = ThisWorkbook.Worksheets.Add
Dim lRowCursor As Long '* this controls pasting down the sheet
lRowCursor = 1
Dim lTableIndexLoop As Long
For lTableIndexLoop = 0 To tablesSelectedByClass.Length - 1
Dim tableLoop As Object 'MSHTML.HTMLTable
Set tableLoop = tablesSelectedByClass.Item(lTableIndexLoop)
If LenB(tableLoop.ID) > 0 Then '* there are some extra nonsense tables, this subselects
Dim sParentColumn As String, objParentColumn As Object ' MSHTML.HTMLSemanticElement
Set objParentColumn = FindMyColumn(tableLoop, sParentColumn) '* need to understand is table on left hand or right hand side
Dim vHeader As Variant: vHeader = Empty
If sParentColumn = "leftColumn" Then
'* tables on the left have a preceding H3 element with the table's description
Dim objH3Headers As Object
Set objH3Headers = objParentColumn.getElementsByTagName("H3")
vHeader = objH3Headers.Item(lTableIndexLoop).innerText
Else
'* tables on the right have a hidden attribute we can use
vHeader = tableLoop.Attributes.Item("data-gae").Value
If Len(vHeader) > 3 Then
vHeader = Mid$(vHeader, 4)
Mid$(vHeader, 1, 1) = Chr(Asc(Mid$(vHeader, 1, 1)) - 32)
End If
End If
'* tables on the right do not have column headers
Dim bHasColumnHeaders As Boolean
bHasColumnHeaders = (tableLoop.ChildNodes.Length = 2)
Dim vTableCells() As Variant '* this will be our table data container which we will paste in one go
Dim lRowCount As Long: lRowCount = 0
Dim lColumnCount As Long: lColumnCount = 0
Dim lDataHeadersSectionIdx As Long: lDataHeadersSectionIdx = 0
Dim objColumnHeaders As Object: Set objColumnHeaders = Nothing
If bHasColumnHeaders Then
Set objColumnHeaders = tableLoop.ChildNodes.Item(0).ChildNodes.Item(0)
lRowCount = lRowCount + 1
lDataHeadersSectionIdx = 1
Else
lDataHeadersSectionIdx = 0
End If
Dim objDataRows As Object 'MSHTML.HTMLElementCollection
Set objDataRows = tableLoop.ChildNodes.Item(lDataHeadersSectionIdx).ChildNodes
lColumnCount = objDataRows.Item(0).ChildNodes.Length
lRowCount = lRowCount + objDataRows.Length
ReDim vTableCells(1 To lRowCount, 1 To lColumnCount) As Variant
'* we have them get the column headers
Dim lColLoop As Long
If bHasColumnHeaders Then
For lColLoop = 1 To lColumnCount
vTableCells(1, lColLoop) = objColumnHeaders.ChildNodes.Item(lColLoop - 1).innerText
Next
End If
'* get the data cells
Dim lRowLoop As Long
For lRowLoop = 1 To lRowCount - VBA.IIf(bHasColumnHeaders, 1, 0)
For lColLoop = 1 To lColumnCount
vTableCells(lRowLoop + VBA.IIf(bHasColumnHeaders, 1, 0), lColLoop) = objDataRows.Item(lRowLoop - 1).ChildNodes.Item(lColLoop - 1).innerText
Next
Next
'* paste our table description
shNewResults.Cells(lRowCursor, 1).Value2 = vHeader
lRowCursor = lRowCursor + 1
'* paste our table data
shNewResults.Cells(lRowCursor, 1).Resize(lRowCount, lColumnCount).Value2 = vTableCells
lRowCursor = lRowCursor + lRowCount + 1
End If
Next
End Sub
Function FindMyColumn(ByVal node As Object, ByRef psColumn As String) As Object
'* this code ascends the DOM looking for "column" in the id of each node
While InStr(1, node.ID, "column", vbTextCompare) = 0 And Not node.ParentNode Is Nothing
DoEvents
Set node = node.ParentNode
Wend
If InStr(1, node.ID, "column", vbTextCompare) > 0 Then
Set FindMyColumn = node
psColumn = CStr(node.ID)
End If
End Function
By the way, if you trade a lot the brokers get rich and you get poor, brokerage charges really impact in long run.

Related

CopyPicture Range Name after other cell

I'm trying to CopyPicture cells in Column B, and name them the value in Column 1. I have code that works, except it keeps giving the pictures the wrong names. The baffling thing is that sometimes it works perfectly, and other times it does not.
I have tried to cobble together a routine based on posted examples of the CopyPicture command. I'm pasting it in below.
Yes, I'm a newbie at VBScript. Be gentle. ;-)
Sub makepic()
Dim path As String
path = "C:\BP\BP2020\JPGs\"
Dim CLen As Integer
Dim cntr As Integer
cntr = 1
Dim rgExp As Range
Dim CCntr As String
CString2 = "A1:A6"
Set rgExp2 = Range(CString2)
CString = "B1:B6"
Set rgExp = Range(CString)
For I = 1 To rgExp.Cells.Count Step 1
CCntr = rgExp2.Cells(I).Value
rgExp.Cells.Cells(I).Font.Size = 72
rgExp.Cells.Cells(I).CopyPicture Appearance:=xlScreen, Format:=xlBitmap
rgExp.Cells.Cells(I).Font.Size = 14
''' Create an empty chart with exact size of range copied
CLen = Len(rgExp.Cells.Cells(I).Value)
CWidth = CLen * 85
With ActiveSheet.ChartObjects.Add(Left:=1600, Top:=rgExp.Top, _
Width:=CWidth, Height:=50)
.Name = "ChartVolumeMetricsDevEXPORT"
.Activate
End With
''' Paste into chart area, export to file, delete chart.
If CCntr <> "" Then
ActiveChart.Paste
Selection.Name = "pastedPic"
ActiveSheet.ChartObjects("ChartVolumeMetricsDevEXPORT").Chart.Export (path + CCntr & ".jpg")
ActiveSheet.ChartObjects("ChartVolumeMetricsDevEXPORT").Delete
End If
cntr = cntr + 1
Next
End Sub
Again, I expect -- for example -- a picture of the contents of cell B1 to have the name of the contents of A1. I tried making the range A1:B4 (for example), but that got me 8 pictures. I finally decided to try to make 2 ranges, but that didn't work either.

Pull data from Website into VBA

This might fall under the dumb question from a newbie. But I honestly don't know where to start in VBA. I tried a few different approaches on the web trying to pull data from the site I'm trying to and all of them failed miserably. Can someone help me (more or less show me) how to pull the data from this website?
https://rotogrinders.com/projected-stats/nfl?site=fanduel
It wouldn't even let me do the data->import. here is what I have so far. I keep getting stuck on line For t = 0 To (Table.Length - 1).
Sub test1()
Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
Dim Table As Object
Dim t As Integer
Dim r As Integer
Dim c As Integer
With appIE
.Navigate "https://rotogrinders.com/projected-stats/nfl?site=fanduel"
.Visible = True
End With
Do While appIE.Busy
DoEvents
Loop
Set Table = appIE.document.getElementById("proj-stats")
For t = 0 To (Table.Length - 1)
For r = 0 To (Table(t).Rows.Length - 1)
For c = 0 To (Table(t).Rows(r).Cells.Length - 1)
ThisWorkbook.Worksheets(1).Cells(r + 1, c + 1) = Table(t).Rows(r).Cells(c).innerText
Next c
Next r
Next t
appIE.Quit
Set appIE = Nothing
End Sub
You are close, and there are several ways to get the data. I chose to extract all row elements (HTML <TD>) and step through a simple loop. Since there are six columns I'm using two variables (r & c for row and column) to offset the data to format correctly.
Set Table = appIE.document.getElementsbytagname("td")
r = 0
c = 0
For Each itm In Table
Worksheets(1).Range("A1").Offset(r, c).Value = itm.innertext
c = c + 1
If c Mod 6 = 0 Then
r = r + 1
c = 0
End If
Next itm
Example Result:
One last note, sometimes the browser didn't finish loading before the script went on... I cheated by using a break point before the loop, waited until it loaded, then hit F5 to continue execution of code to ensure it would alway run.

Trying to write a an Excel macro to find a large text string and copy to another sheet

I have a spreadsheet with a column of cells that each contain several paragraphs of text. I'm trying to write a macro that will grab several sentences between these text phrases "How we made our decision" and "Conclusion"
The location of this text string varies from sheet to sheet but the column is always consistent.
I've been able to find a bunch of vba scripts that allow me to find and copy 1 word at a time or simple batches of single word. I'm just not able to figure our or find something that allows me to copy an entire paragraph from within a single cell of paragraphs.
The code below just grabs the entire table. As you can see in the beginning portion I was able to get what I need however I found out that the (70) is irrelevant because the table size changes with each pull of the record.
Sub GetTheData()
Dim T As String
Dim SWs As New SHDocVw.ShellWindows
Dim IE As SHDocVw.InternetExplorer
Dim LetPr As InternetExplorer
Dim Doc As HTMLDocument
'Dim IE As Object
Dim tbls, tbl, trs, tr, tds, td, r, c
For Each IE In SWs
If IE.LocationName = "Letter Preparation Case Summary – Member Case" Then
Set LetPr = IE
'LetPr.document.getElementById
T = IE.document.getElementsByTagName("td")(70).innerText
'T = Trim(Mid(T, InStr(T, "How We Made Our Decision: ") + 0, InStr(T, "Conclusion") - (InStr(T, "How We Made Our Decision:") + 26)))
Exit For
End If
Next
Set tbls = IE.document.getElementsByTagName("table")
For r = 0 To tbls.Length - 1
Debug.Print r, tbls(r).Rows.Length
Next r
Set tbl = IE.document.getElementsByTagName("table")(9)
Set trs = tbl.getElementsByTagName("tr")
For r = 0 To trs.Length - 1
Set tds = trs(r).getElementsByTagName("td")
'if no <td> then look for <th>
If tds.Length = 0 Then Set tds = trs(r).getElementsByTagName("th")
For c = 0 To tds.Length - 1
ActiveSheet.Range("A1").Offset(r, c).Value = tds(c).innerText
Next c
Next r
End Sub
You stated that you wanted the text 'between these text phrases' so the beginning position of the found text will have to be adjusted by hte length of the searched string.
dim beginStr as string, endStr as string, beginPos as long, endPos as long
beginStr = "How We Made Our Decision:"
endStr = "Conclusion"
beginPos = instr(1, T, beginStr, vbtextcompare)
endPos = instr(beginPos, T, endStr, vbtextcompare)
if cbool(beginPos) and cbool(endPos) then
beginPos = beginPos + len(beginStr)
T = Trim(Mid(T, beginPos, endPos - beginPos))
end if
That last endPos - beginPos might have to be adjusted by subtracting 1.

VBA makro to format XML in Excel to CSV

I need to reformat a XML file to .CSV.
I already opened the XML in Excel and did a little formating but now I really need to write a macro to get the data into shape. I already started bu I really have issues with the loop logic.
the List has a couple thousand Articles with a variable amount of subarticles.
each subarticle as a the same amount of properties but not every article has the same properties.
https://picload.org/image/ipialic/now.jpg
https://picload.org/image/ipialip/then.jpg
My Code up till now looks like this:
Option Explicit
Dim rowCount As Long, articleCount As Long, propertyCount As Integer, name As String
Sub Sortfunction()
rowCount = 1
articleCount = 0
propertyCount = 0
Do While Sheets("Test").Cells(rowCount, 1).Value <> "end"
If Cells(rowCount, 1).Value = "Reference" Then
rowCount = rowCount + 1
Do While Cells(rowCount, 3).Value = ""
If Cells(rowCount, 3).Value = "4" Then
End If
articleCount = articleCount + 1
Loop
articleCount = articleCount + 1
End If
rowCount = rowCount + 1
Loop
Sheets("result").Cells(1, 1).Value = rowCount
Sheets("result").Cells(2, 1).Value = articleCount
End Sub
At the end of the document i wrote the "end" to have a hook to stop the loop.
Can anyone provide some help? I'm really not the best programmer :-/
I'd really appreciate any help I can get :-)
here he's a translation into algorithm and some tips on functions
update: it was more tricky than I thought... I had to rewrite the code.
The main problem is "how to decide when change column".
I choose this solution "Each product in reference must have the same amount of properties".
If it's not the case, please indicate "how you decide when you have to create a new Column" (you can explain it in plain words)
Here the code rewrited. I tried it on your exemple, it work
Public Sub test()
' Set the range to navigate in your first sheet
Dim cell As Range: Set cell = Sheets("Feuil1").Range("A1")
' set the range to navigate in your result sheet
Dim res As Range: Set res = Nothing
' pos will be used to know the position of a product
Dim lastProperties As Range, posProperties As Range
' While the cell value is not "end"
Do While cell <> "end"
' if the cell is a reference
If cell = "Reference" Then
' Set the range of res
If res Is Nothing Then
Set res = Sheets("Feuil2").Range("A1")
Else
Set res = Sheets("Feuil2").Range("A" & lastProperties.offset(2).Row)
End If
' I set Offset(2) so you will have an empty line between 2 references
' Set the text of the new reference in the result
res = cell.offset(, 1) ' The reference is the cell 1 offset the right of the cell "Reference"
' WARNING : here no writing of titles anymore. It'll be done in the "Else".
' Here you just write "new reference" and reinit var
Else
' Here we have a property
' If the property alreay exist, consider it a new product in the reference
' When we are on a new property, the column of the product if the next to the right
If GetProperties(cell.offset(, 3), res, posProperties) Then
Set lastProperties = posProperties
End If
posProperties = cell.offset(, 4)
End If
' BIG FORGET: you have to get the next cell
Set cell = cell.offset(1)
Loop
End Sub
And the function to search / create your properties
Private Function GetProperties(ByVal propValues As String, ByVal start As Range, ByRef position As Range) As Boolean
Set position = start.offset(1)
' Is the cell below the properties ? Return the row below
' Search for the first "empty row" on the line
If position = propValues Then
Do
Set position = position.offset(, 1)
Loop While Trim(position) <> ""
' Indicate it's an existing value
GetProperties = True
Exit Function
End If
' Is the range empty ?
If Trim(position) = "" Then
' Create the new properties
position = propValues
Set position = position.offset(, 1)
GetProperties = False
Exit Function
End If
' Search the properties in the row below
GetProperties = GetProperties(propValues, position, position)
End Function
It should do the work. If you have any question on understanding some part, don't hesitate
if you don't know about Offset, some reading : https://msdn.microsoft.com/en-us/library/office/ff840060.aspx

Excel VBA - Perform Operations on visible cells only

I have a database that has in excess on 200,000 rows. When I was writing a VBA script I had a database of about 20,000 rows in mind so I didn't care whether the database was filtered or not because the VBA script ran quickly. So given the realization that the database is huge and testing the VBA script I was surprised to notice how slowly it ran. So without further to say this is how my code looks like :
Set wsDB = ThisWorkbook.Sheets("DB")
Dim nameIndex As Long: nameIndex = Application.Match(name, wsDB.Rows(1), 0)
Dim formula As String
formula = "=IFERROR(AVERAGEIFS(" + GRA(nameIndex) + "," + GRA(dateIndex) + ",R2C," + GRA(cellNameIndex) + ",RC1" + "),"""")"
where GRA is a function that returns the address of the range of a column.
Private Function GRA(ByRef rngIndex As Long)
GRA = "DB!" + CStr(Range(Cells(2, rngIndex), Cells(rowNos, rngIndex)).Address(1, 1, xlR1C1, 0, 0))
End Function
So given that I now filter the table beforehand how can I adjust my code so that it ignores all the hidden rows and takes into account only what is visible. Of course I am aware that a simple dirty solution would be to simply copy the filter database and paste it in a new sheet but that will affect the performance which is what I'm trying to improve.
You can use the following function to return a range of only visible cells.
Function VisibleCells(Rng As Range) As Variant
Dim R As Range
Dim Arr() As Integer
Dim RNdx As Long
Dim CNdx As Long
If Rng.Areas.Count > 1 Then
VisibleCells = CVErr(xlErrRef)
Exit Function
End If
ReDim Arr(1 To Rng.Rows.Count, 1 To Rng.Columns.Count)
For RNdx = 1 To Rng.Rows.Count
For CNdx = 1 To Rng.Columns.Count
Set R = Rng(RNdx, CNdx)
If (R.EntireRow.Hidden = True) Or _
(R.EntireColumn.Hidden = True) Then
Arr(RNdx, CNdx) = 0
Else
Arr(RNdx, CNdx) = 1
End If
Next CNdx
Next RNdx
VisibleCells = Arr
End Function
The above code came from http://www.cpearson.com/excel/VisibleCells.aspx.
Normally I would only post code that I write however this does exactly what I was thinking.