How to connect to Apache spark/hadoop from VBA - sql

I am familiar with SQL (especially postgres) and VBA, but on the Apache spark side, I am a total newbie, but it seems to run and return queries results much faster than using SQL?
As of now in my daily work, I connect my Excel VBA with Postgresql through OLEDB (others ppl use ODBC and etc.), so whenever I need to retrieve something from DB, I can easily do so by set a connection and write a SQL string in VBA, and then dump the output in the desired sheets and cells. But the drawback there is the speed, as my data grows larger and larger, when I need to run complicated SQL queries for complicated calculations or relations, it took a long long wait to get the results.
So other than upgrade our server which host the DB, I heard Spark/Hadoop is the solution for speeding up the task?
Typically when I need to do the VBA-postgres interaction, I do something like the following:
Public Sub refresh_cf()
Dim dataConn As New ADODB.Connection
Dim strSQL As String
Dim strCON As String
Dim strCmd As New ADODB.Command
Dim loadTable As QueryTable
strCON = "Server=server IP;" & _
"DSN=PostgreSQL35W;" & _
"UID=USERNAME;" & _
"PWD=PASSWORD;" & _
"Database=DBNAME;" & _
"Port=5432;" & _
"CommandTimeout=12"
dataConn.ConnectionString = strCON
dataConn.Open
strSQL = "SELECT * FROM TABLE WHERE...."
strCmd.ActiveConnection = dataConn
strCmd.CommandType = adCmdText
strCmd.CommandText = strSQL
strCmd.CommandTimeout = 0
Set loadTable = Sheet2.QueryTables.Add(Connection:=strCmd.Execute,
Destination:=Sheet2.Range("A4"))
With loadTable
.BackgroundQuery = False
.AdjustColumnWidth = False
.refresh
End With
End Sub
I am wondering if we can achieve the same to connect with Apache Spark or Hadoop and return query results by this way??
Say each of the databases we are processing is fairly huge, billions of rows (and countless cells per row), and if I perform some complex relational calculations within this DB, currently the postgres on our server may take hours if not days to complete the task (even the returned results might not be that big, i.e. does not exceed the 1.6 million rows limit of per excel sheet), is it worthy to utilize Hadoop or Spark via VBA if possible?
I know we can do this in Python for sure, something like:
Python
from pyspark.sql import HiveContext, Row
#or
from pyspark.sql import SQLContext, Row
sc = SparkContext(...)
hiveCtx = HiveContext(sc)
#Then we can run query
hiveCtx.sql("""SELECT * FROM TABLE WHERE....""")
Moreover, I find a link which introduced the ODBC connection to Hadoop, but can someone share your way if you are doing so, and provide some basic example code to clarify the process? Thanks.

I don't have Hadoop installed on my personal laptop, so I can't test this process, but I think it should be essentially like this.
You can see all details here.
https://learn.microsoft.com/en-us/azure/hdinsight/hadoop/apache-hadoop-connect-excel-hive-odbc-driver

Related

Importing a text file into Postgres using ODBC

I am trying to directly import a text file into Postgres, but using ODBC. I realize that this is a bit of an odd thing to do, but ODBC does a good job of fixing/ignoring errors in the text files and Postgres' Copy command is very, very picky. I use Copy when I can and ODBC where I can't.
I am currently doing this in two steps. ODBC Import to Access and then from Access to Postgres but I recently learned over on MSDN I may be able to do this in one step but am having trouble with the SQL.
Here is the code I am using:
Dim TextConnection As New System.Data.OleDb.OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0;" & _
"Data Source=c:\PathToTextFile;Extended Properties=""Text;HDR=No""")
Dim TextCommand As System.Data.OleDb.OleDbCommand
TextCommand = New System.Data.OleDb.OleDbCommand("SELECT * INTO [ODBC;Driver={PostgreSQL};" & _
" Server=server;Database=database;Uid=UserName;Pwd=Password;].[TableName] FROM [textfile.txt]", TextConnection)
TextCommand.ExecuteNonQuery()
I am getting this error: Query input must contain at least one table or query.
I am not where to go from here in debugging this. It also just might now be possible and that would be good to know.

Connect to SQL via ODBC using a .bat

I am currently performing a SQL query using Oracle SQL Developer and pasting in a standard query (same SELECT every time), then exporting to a csv file. I would like to execute the query via a batch file and place the output in a specified folder. I already use many batch files on this machine and would like to make this query part of a routine.
My machine has an existing ODBC connection to "WHPROD", but I do not know how to use it. Is there a way to connect to WHPROD from a batch file?
This is hardly possible in batch script directly without going overcomplicated
However it is simple to do it with VBscript, and since you can call your VBscript from a batch script, the result will be exactly what you want.
Here's an example of how to connect to Oracle and retrieve the results of a SELECT from VBS : (source)
Dim strCon
strCon = “Driver={Microsoft ODBC for Oracle}; ” & _
“CONNECTSTRING=(DESCRIPTION=” & _
“(ADDRESS=(PROTOCOL=TCP)” & _
“(HOST=Server_Name)(PORT=1521))” & _
“(CONNECT_DATA=(SERVICE_NAME=DB_Name))); uid=system;pwd=system;”
Dim oCon: Set oCon = WScript.CreateObject(“ADODB.Connection”)
Dim oRs: Set oRs = WScript.CreateObject(“ADODB.Recordset”)
oCon.Open strCon
Set oRs = oCon.Execute(“SELECT name from v$database”)
While Not oRs.EOF
WScript.Echo oRs.Fields(0).Value
oRs.MoveNext
Wend
oCon.Close
Set oRs = Nothing
Set oCon = Nothing
And here's how to call your VBS from the batch script :
#echo off
start "C:\\yourbatchfolder\\yourscript.vbs"
I wrote an in-depth batch script once that builds a MSSQL database. There are lots of great hints in that script that may help you get going.
The script is not specifically using ODBC but I do believe SQLCMD args can be changed to use a ODBC defined connection? That may work for you; and not just on MSSQL server.

Can I create a pass-through query via VBA to call a parameterized tSQL UDF and send it a dynamic parameter to return results to Access?

I currently have a SQL 2008 R2 database backend with an Access 2013 accdb front end with ODBC DSN-less connection and linked tables. In SQL I have many parameterized tSQL UDFs created to feed data into reports (currently working well in my Access 2010 adp frontend). The reports are complicated: multiple tSQL UDFs run calculations and then feed into a final UDF that feeds the respective report. I would like to keep the UDFs on the server – rewriting into Access queries would be a poor solution.
My problem is that I have not been able to figure out how to write the VBA correctly to send a pass-through query to call the tSQL UDF and give it a parameter, which would change for each report. I know pass-through queries are read-only, that’s ok. I’ve read that I can call a stored procedure (SP) from VBA, but can I call the UDF rather than having to convert each to a SP or create a SP just to call the UDF so that I could call the SP from VBA. Based on my research, I think I might have to either create a SP to call the UDF or convert the UDF to a SP to get the VBA to work (i.e., return results without error). Is this correct?
I found this discussion: https://social.msdn.microsoft.com/Forums/office/en-US/898933f5-73f9-44e3-adb9-6aa79ebc948f/calling-a-sql-udf-from-access?forum=accessdev , but it has conflicting statements “You can't call a tSql udf from Access.”, and “You can use a passthrough query to call UDF's or stored procedures or anything else written in tsql.” Also, their code is written in ADO instead of DOA so it’s a bit cryptic to me since I’ve only written DAO so far, but the general gist that I got was they converted their UDF to a SP.
I found this article a great read, but again did not get a clear “yes” to my question:
http://technet.microsoft.com/en-us/library/bb188204(v=sql.90).aspx
It may be possible to remove the parameter from the Server side and add it to the Access side similar to this Access 2007 forms with parameterized RecordSource , but wouldn't that cause Access to load the entire dataset before filtering, instead of processing on the Server side – possibly causing performance issues?
I can successfully create a pass-through query in the Access interface if I supply it with a constant parameter, for example “Select * from udf_FinalReport(2023)”, but what I really need is to be able to pass a dynamic parameter. For example, the parameter would be from Forms!Control![txtboxValue]. Can I do this? The following code is what I’m using– it works if I use a table name in the SQL (ex, “SELECT * FROM Table WHERE tblYear = “&intYear ) in line 9 so I feel like I have everything coded right, but when I put my UDF in the SQL like below I get the error #3131 “Syntax error in FROM clause.” (I did verify that I should not use the prefix schema (dbo.) – this gives error 3024 “could not find file”.) Is this user error or just plain telling me I can’t call a UDF this way?
1 Sub AnnualSummary()
2 Dim dbs As DAO.Database
3 Dim qdfPoint As DAO.QueryDef
4 Dim rstPoint As DAO.Recordset
5 Dim intYear As Integer
6 intYear = Reports!Annual_Delineation_Summary!txtYear
7 Set dbs = OpenDatabase("", False, False, "ODBC;DRIVER=sql server;SERVER=******;APP=Microsoft
8 Office 2010;DATABASE=*******;Network=DBMSSOCN")
9 Set qdfPoint = dbs.CreateQueryDef("", "Select * from udf_AnnualReport(" & intYear& ")")
10 GetPointTemp qdfPoint
11 ExitProcedure:
12 On Error Resume Next
13 Set qdfPoint = Nothing
14 Set dbs = Nothing
15 Set rstPoint = Nothing
16 Exit Sub
17 End Sub
18
19 Function GetPointTemp(qdfPoint As QueryDef)
20 Dim rstPoint As Recordset
21 With qdfPoint
22 Debug.Print .Name
23 Debug.Print " " & .SQL
24 Set rstPoint = .OpenRecordset(dbOpenSnapshot)
25 With rstPoint
26 .MoveLast
27 Debug.Print " Number of records = " & _
28 .RecordCount
29 Debug.Print
30 .Close
31 End With
32 End With
33 End Function
I also tried writing the code a little differently, using the following instead of lines 5, 6, and 9. This also works when I use a table name in the select statement, but I get error #3131 when I use a UDF name:
Set qdfPoint = dbs.CreateQueryDef("", "Parameters year int; Select * from Point_Info where
year(Sample_Date) = year")
qdfPoint.Parameters("year").Value = intYear
Both code variations also work if I try use the name of a SQL View in the tSQL SELECT statement.
My consensus is using ADO language instead of DAO to write the pass-through query works well. But, I have found that it is still probably better to execute a stored procedure than to try to call the UDF. Here is the code that ended up working most smoothly for me: (my ADO Connection uses Public variables strUID and strPWD)
Dim cn As ADODB.Connection
Dim rs As ADODB.Recordset
Dim strPoint As String
strPoint = Forms!FRM_Vegetation_Strata!Point_ID
Set cn = New ADODB.Connection
cn.Open "Provider = sqloledb;Data Source=imperialis.inhs.illinois.edu;" & _
"Initial Catalog=WetlandsDB_SQL;User Id=" & strUID & ";Password=" & strPWD
Set rs = New ADODB.Recordset
With rs
Set .ActiveConnection = cn
.Source = "sp_Report_VegWorksheet '" & strPoint & "'"
.LockType = adLockOptimistic
.CursorType = adOpenKeyset
.CursorLocation = adUseClient
.Open
End With
Set Me.Recordset = rs
On a side note I found that to get set the .Recordset to fill a subform put this code in the "Open" event of the subform.
Then to clean up your connection:
Private Sub Form_Unload(Cancel As Integer) 'use "unload", not "close"
'Close the ADO connection we opened
Dim cn As ADODB.Connection
Dim rs As ADODB.Recordset
Set cn = Me.Recordset.ActiveConnection
cn.Close
Set cn = Nothing
Set rs = Nothing
Set Me.Recordset = Nothing
End Sub
This approach does not work for populating a report. "Set Me.Recordset" only works for forms. I believe I will have to call a stored procedure then populate a temp table to use as the report recordset.
EDIT: I have found that I can call a SQL UDF or SP from VBA in Access using DOA. This is particularly helpful when one wants to pull the data from a complicated SQL function/procedure and put it into an Access-side temp table. See Juan Soto's blog https://accessexperts.com/blog/2012/01/10/create-temp-tables-in-access-from-sql-server-tables/#comment-218563 This code puts the info into a temp table, which is what I wanted to populate my reports. I used his code example and the following to call the sub:
To execute as SP: CreateLocalSQLTable "testTBL","exec dbo.sp_Report_WetDet_point '1617-1-1A'",False
To call a UDF: CreateLocalSQLTable "testTBL","Select * from dbo.QryReport_Main('1617-2-2A')",False
I don't know if it's the most efficient method of passing a variable parameter through a pass-through query into a function and returning the results to Access, as I am still relatively new to Access, but I came across this earlier when I was attempting a similar problem.
I managed it by creating a couple of pass-through queries that executed functions in SQL server and returned a result. I then made a small VBA script that re-wrote the pass-through queries with the new variable every time I wanted to change it, and executed them.
I got the result back out using OpenRecordset, and stored it as a string to use in the rest of my code.

Access increase speed of DLookup by "caching" tables or other strategy?

Background
I have an Access splitform with multiple DLookups. There are about 10 total DLookups on the form and there are approximately 25-50 records displayed at any one time in the Splitform view.
The Access frontend is linked to SQL tables.
When the DLookup values are displayed in the Datasheet view, it becomes quite slow to view the information, because there are frequent recalculations (each time anything in the dataset changes Access appears to recalculate all DLookups for the entire Splitform datasheet). This was very noticeably and unacceptably slow when connecting through VPN.
Research
I decided to investigate and wrote the following to determine why things were so slow. I also wanted to check if DLookup was slower than a SQL query for some reason.
sub testLotsofDlookups()
Dim count As Integer
Dim startTime As Date
Dim endTime As Date
Dim numbTries As Integer
Dim t As String
numbTries = 100
startTime = Now
count = 0
Dim dbs As DAO.database
Dim rsSQL As DAO.Recordset
Dim strSQL As String
Set dbs = CurrentDb
'Open a snapshot-type Recordset based on an SQL statement
strSQL = "Select FullName from ToolDesigners Where ToolDesignersID=4;"
startTime = Now
For count = 1 To numbTries
Set rsSQL = dbs.OpenRecordset(strSQL, dbOpenSnapshot)
t = rsSQL.Fields(0)
Next count
Dim mDiff As Double
mDiff = DateDiff("s", startTime, Now)
Debug.Print "SQL Total time:" & vbTab & DateDiff("s", startTime, Now)
Debug.Print "SQL Average time:" & vbTab & mDiff / numbTries
'
'
'
'
'
startTime = Now
For count = 1 To numbTries
t = DLookup("FullName", "ToolDesigners", "ToolDesignersID=4")
Next count
mDiff = DateDiff("s", startTime, Now)
Debug.Print "DLookupUp Total time:" & vbTab & DateDiff("s", startTime, Now)
Debug.Print "DLookupUp Average time:" & vbTab & mDiff / numbTries
end sub
(I understand this is only precise to single seconds)
Interestingly, I found that on average each DLookup and SQL query was taking nearly 0.5 seconds. While working on company intranet, I still have times of over 0.10 seconds on average. Both are very comparable in speed.
This causes very slow form refresh as well as VERY slow datasheet refresh.
I then tested against a SQLExpress database hosted on my machine - times dropped to 0.0005 seconds on average.
Question
It seems DLookups are slow in this application. I am hoping to find an alternative and faster approach.
What I would like to be able to do is to somehow cause the DLookup to run against local tables Access presumably keeps rather than the SQL tables on the server. It seems I could either create temp tables every time I open a form or the database (not a fan) - is there a better way?
It seems if I was referring to another Access database I could just use "opendatabase" which then keeps it in memory. This then increases the speed of queries against that database. 100% of the examples I find are referring to Access databases though, not SQL.
Alternatively I could use something other than DLookup, which is what I thought when testing the SQL commands but I'm not really sure what to do because SQL was comparable speed.
If it's just single values then I'd be inclined to use a simple in-memory cache -
Private mToolDesignerFullNameCache As New Scripting.Dictionary
Function GetToolDesignerFullName(Criteria As String)
If mToolDesignerFullNameCache.Exists(Criteria) Then
GetToolDesignerFullName = mToolDesignerFullNameCache(Criteria)
Else
Dim Name
Name = DLookup("FullName", "ToolDesigners", Criteria)
mToolDesignerFullNameCache.Add(Criteria, Name)
GetToolDesignerFullName = Name
End If
End Function
Sub ResetToolDesignerFullNameCache()
mToolDesignerFullNameCache.RemoveAll
End Sub
Requires adding 'Microsoft Scripting Runtime' as a VBA reference to compile. In the past I have found this sort of thing useful even when using an Access backend given how often the Access UI will poll for data.

Is latency or my VPN choking my Excel to SQL Server uploads?

I am uploading data from Excel to SQL Server using the following structure:
Private Sub ado_upload()
Dim objConnection As New ADODB.Connection
Dim objCommand As New ADODB.Command
Dim strSQL As String
Dim strDSN As String
Dim intCounter As Integer
strDSN = "provider=SQLOLEDB;" _
& "server=<server>;" _
& "database=<database>;" _
& "uid=<uid>;pwd=<pwd>;" _
& "trusted_connection=false;"
With objConnection
.ConnectionString = strDSN
.Open
End With
strSQL = "SET NOCOUNT ON; " _
& "INSERT INTO dbo.[table1] ( [col1] ) VALUES ( ? );"
With objCommand
.ActiveConnection = objConnection
.CommandText = strSQL
.Prepared = True
.Parameters.Append .CreateParameter("col1", adInteger, adParamInput)
For intCounter = 0 To 9
.Parameters("Col1").Value = intCounter
.Execute
Next intCounter
End With
End Sub
The speed of the procedure depends on the geographic distance between the server and the computer running the procedure. On the server itself it is fast (300,000 inserts in under 10 minutes), on the other side of the country it is relatively slow (300,000 inserts could take hours). Remote uploads operate over a VPN.
I think network latency must be slowing the process down. Is there any way to get around network latency, or tweak the VPN to make the uploads faster?
Thanks!
The speed of the procedure depends on the geographic distance between the server and the computer running the procedure.
It really depends on how many hops you and the machine you're connecting to on the other end are from major trunks, and the routes between. It's not always the same route, either. Use the tracert command to see, because the bottleneck could be packets going through a 10 Mb/s connection rather than an OC line.
On the server itself it is fast (300,000 inserts in under 10 minutes),
It should go without saying that performing something local to the machine will run at the fastest possible speed, outside of load on the machine at the given time.
...uploads operate over a VPN.
While it is secure, it's compounding the problem because the overhead of that security means you can send less per packet than if the data were unencrypted.
As I mentioned before, you don't any control over routes taken once your packets are outside your network. All I can suggest is to upload the files to the server, and then run from there.
Downloads are so much faster than uploads, even though it's still just an ADO Command. Is that because the data is being downloaded from SQL Server as binary data? Is there any way to achieve that same performance on the upload without uploading the file to the server first?
Depends on the connection & the contract terms. Residential internet is always an asymmetric connection - you'll have more download bandwidth than upload. The reason being that surfing/etc is download centric - you use very little bandwidth for uploads, like page requests and submitting forms. Until you want to upload files...
The only way to get better upload speeds is to get the better connection and/or contract terms.